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Preface 



I wish to welcome all of you to the International Symposium on High Perfor- 
mance Computing 2000 (ISHPC 2000) in the megalopolis of Tokyo. After having 
two great successes with ISHPC’97 (Fukuoka, November 1997) and ISHPC’99 
(Kyoto, May 1999), many people have requested that the symposium would be 
held in the capital of Japan and we have agreed. 

I am very pleased to serve as Conference Chair at a time when high per- 
formance computing (HPC) has a significant influence on computer science and 
technology. In particular, HPC has had and will continue to have a significant im- 
pact on the advanced technologies of the “IT” revolution. The many conferences 
and symposiums that are held on the subject around the world are an indication 
of the importance of this area and the interest of the research community. 

One of the goals of this symposium is to provide a forum for the discussion 
of all aspects of HPC (from system architecture to real applications) in a more 
informal and personal fashion. Today we are delighted to have this symposium, 
which includes excellent invited talks, tutorials and workshops, as well as high 
quality technical papers. 

In recent years, the goals, purpose and methodology of HPC have changed 
drastically. HPC with high-cost, high-power consumption and diflicult-to-use 
interfaces will no longer attract users. We should instead use what the IT revo- 
lution of the present and near future gives us: highly integrated processors and 
extremely fast internet. Mobile and wearable computing is already common- 
place and the combination with multimedia and various database applications 
is promising. Therefore we would like to treat HPC technologies as systems and 
applications for low-end users as well as conventional high-end users, where we 
can find a bigger market. In this symposium, we will discuss the direction of 
such HPC technologies with hardware, software and applications specialists. 

This symposium would not have been possible without the significant help 
of many people who devoted resources and time. I thank all of those who have 
worked diligently to make the ISHPC 2000 a great success. In particular I would 
like to thank the Organizing Chair, Masaru Kitsuregawa of the University of 
Tokyo, and all members of the organizing committee, who contributed very sig- 
nificantly to the planning and organization of ISHPC 2000. 1 must also thank the 
Program Chair, Mateo Valero of the Technical University of Catalunya, and the 
program committee members who assembled an excellent program comprising a 
very interesting collection of contributed papers from many countries. 

A last note of thanks goes to the Kao Foundation for Arts and Science, the 
Inoue Foundation for Science, the Telecommunications Advancement Foundation 
and Sumisho Electronics Co. Ltd for sponsoring the symposium. 



October 2000 



Hidehiko Tanaka 




Foreword 



The 3rd International Symposium on High Performance Computing (ISHPC 
2000 held in Tokyo, Japan, 16-18 October 2000) was thoughtfully planned, or- 
ganized, and supported by the ISHPC Organizing Committee and collaborative 
organizations. 

The ISHPC 2000 Program consists of two keynote speeches, several invited 
talks, two workshops on OpenMP and Simulation-Visualization, a tutorial on 
OpenMP, and several technical sessions covering theoretical and applied re- 
search topics on high performance computing which are representative of the 
current research activities in industry and academia. Participants and contribu- 
tors to this symposium represent a cross section of the research community and 
major laboratories in this area, including the European Center for Parallelism 
of Barcelona of the Polytechnical University of Catalunya (UPC), the Center 
for Supercomputing Research and Development of the University of Illinois at 
Urbana-Champaign (UIUC), the Maui High Performance Computing Center, 
the Kansai Research Establishment of Japan Atomic Energy Research Institute, 
Japan Society for Simulation Technology, SIC ARCH and SIGHPC of Informa- 
tion Processing Society Japan, and the Society for Massively Parallel Processing. 

All of us on the program committee wish to thank the authors who submitted 
papers to ISHPC 2000. We received 53 technical contributions from 17 countries. 
Each paper received at least three peer reviews and, based on the evaluation 
process, the program committee selected fifteen regular (12-page) papers. Since 
several additional papers received favorable reviews, the program committee 
recommended a poster session comprised of short papers. Sixteen contributions 
were selected as short (8-page) papers for presentation in the poster session and 
inclusion in the proceedings. 

The program committee also recommended two awards for regular papers: a 
distinguished paper award and a best student paper award. The distinguished 
paper award was given to “Processor Mechanisms for Software Shared Memory” 
by Nicholas Carter, and the best student paper award was given to “Limits of 
Task-Based Parallelism in Irregular Applications” by Barbara Kreaseck. 

ISHPC 2000 has collaborated closely with two workshops: the International 
Workshop on OpenMP: Experiences and Implementations (WOMPEI) organized 
by Eduard Ayguade of the Technical University of Catalunya, and the Interna- 
tional Workshop on Simulation and Visualization (IWSV) organized by Kat- 
sunobu Nishihara of Osaka University. Invitation-based submission was adopted 
by both workshops. The ISHPC 2000 program committee decided to include all 
papers of WOMPEI and IWSV in the proceedings of ISHPC 2000. 




Foreword 



VII 



We hope that the final program will be of significant interest to the partici- 
pants and will serve as a launching pad for interaction and debate on technical 
issues among the attendees. 
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VIII Foreword 



Foreword from WOMPEI 

First of all, we would like to thank the ISHPC Organizing Committee for giv- 
ing us the opportunity to organize WOMPEI as part of the symposium. The 
workshop consists of one invited talk and eight contributed papers (four from 
Japan, two from the United States and two from Europe). They report some of 
the current research and development activities related to tools and compilers 
for OpenMP, as well as experiences in the use of the language. The workshop 
includes a panel discussion (shared with ISHPC) on Programming Models for 
New Architectures. We would also like to thank the Program Committee and 
the OpenMP ARB for their support in this initiative. Finally, thanks go to the 
Real World Computing Partnership for the financial support to WOMPEI. We 
hope that the program will be of interest to the OpenMP community and will 
serve as a forum for discussion on technical and practical issues related to the 
current specification. 



E. Ayguade (Technical University of Catalunya), 
H. Kasahara (Waseda University) and 
M. Sato (Real World Computing Partnership) 



Foreword from IWSV 

Recent rapid and incredible improvement of HPC technologies has encouraged 
numerical computation users to use larger and therefore more practical simula- 
tions. The problem such high-end users face is how to analyze or even understand 
the results calculated with huge computation times. The promising solution to 
this problem is the use of visualization. 

IWSV is organized as part of ISHPC 2000 and consists of II contributed 
papers and abstracts. We would like to thank the ISHPC 2000 Organizing Com- 
mittee for providing us with this opportunity. We would also like to thank the 
ISHPC 2000 Program Committee for having IWSV papers and abstracts in- 
cluded in the proceedings, which we did not expect. 

We hope that IWSV will be of fruitful interest to ISHPC 2000 participants 
and will indicate a future direction of collaboration between numerical compu- 
tation and visualization researchers. 
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Instruction Level Distributed Processing: 
Adapting to Future Technology 



J. E. Smith 

Dept, of Elect, and Comp. Engr. 
1415 Johnson Drive 
Univ. of Wisconsin 
Madison, WI 53706 



1. Introduction 

For the past two decades, the emphasis in processor microarchitecture has been on 
instruction level parallelism (ILP) — or in increasing performance by increasing the 
number of "instructions per cycle". In striving for higher ILP, there has been an 
ongoing evolution from pipelining to superscalar, with researchers pushing toward 
increasingly wide superscalar. Emphasis has been placed on wider instruction fetch, 
higher instruction issue rates, larger instruction windows, and increasing use of pred- 
iction and speculation. This trend has led led to very complex, hardware-intensive 
processors. 

This trend is based on "exploiting" technology improvements. The ever-increasing 
transistor budgets have left researchers with the view that the big challenge is to con- 
sume transistors in some fashion, i.e. "how are we going to use a billion transistors?" 
[1]. Starting from this viewpoint, it is not surprising that the result is hardware- 
intensive and complex. Furthermore, the complexity is not just critical path lengths 
and transistor counts; there is also high intellectual complexity — from attempts to 
squeeze performance out of second and third order effects. 

Historically, computer architecture innovation has done more than exploit technol- 
ogy; it has also been used to accommodate technology shifts. A good example is 
cache memories. In the late 1970s, RAM cycle times were as fast microprocessor 
cycle times. If main memory can be accessed in a single cycle, there is no need for a 
cache. However, over the years, RAM speeds have not kept up with processor speeds 
and increasingly complex cache hierarchies have filled the gap. Architecture innova- 
tion was used to avoid tremendous slow downs - by adapting to the shift in 
processor/RAM technologies. 

There are a number of significant technology shifts taking place right now. Wire 
delays are coming to dominate transistor delays [2], static power will soon catch up 
and pass dynamic power in importance [3]. Fast transistors will no longer be "free", 
at least when in terms of power consumption. There are also shifts in applications — 
toward commercial memory-based applications, object oriented dynamic linking, and 
multiple threads. 

2. Implications: Instruction Level Distributed Processing 

Given the above, the next trend in microarchitecture will likely be Instruction Level 
Distributed Processing (ILDP). The processor will consist of a number of distributed 
functional units, each fairly simple with a very high frequency clock cycle. There 
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will likely be multiple clock domains. Global interconnections will be point-to-point 
with delays of a clock cycle or greater. Partitioning the system to accommodate these 
delays will be a significant part of the microarchitecture design effort. There may be 
relatively little low-level speculation (to keep the transistor counts low and the clock 
frequency high) — determinism is inherently simpler than prediction and recovery. 

2.1. Dependence-based Microarchitecture 

One type of ILDP processors consists of clustered "dependence-based" microarchi- 
tectures [4]. The 21264 [5] is a commercial example, but a very early and little 
known example was an uncompleted Cray-2 design [6]. In these microarchitectures, 
processing units are organized into clusters and dependent instructions are steered to 
the same cluster for processing. 

The 21264 microarchitecture there are two clusters, with some instructions routed to 
each at issue time. Results produced in one cluster require an additional clock cycle 
to be routed to the other. In the 21264, data dependences tend to steer communica- 
tion instructions to the same cluster. Although there is additional inter-cluster delay, 
the faster clock cycle compensates for the delay and leads to higher overall perfor- 
mance. 

In general a dependence based design may be divided into several clusters, cache pro- 
cessing can be separated from instruction processing, integer processing can be 
separated from floating point, etc. (See Fig. 1) In a dependence-based designs, 
dependent instructions are collected together, so instruction control logic within a 
cluster is likely to be simplified, because there is no need to look for independence if 
it is known not to exist. 

2.2. Heterogeneous ILDP 

Another model for ILDP is heterogeneous processors where a simple core pipeline is 
surrounded by outlying "helper engines" (Fig. 2). These helper engines are not in the 




Fig. 1 . A distributed data-dependent microarchitecture. 
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critical processing path, so they have non-critical communication delays with respect 
to the main pipeline, and may even use slower transistors. 

Examples of helper engines include the pre-load engine of Roth and Sohi [7] where 
pointer chasing can be performed by a special processing unit. Another is the branch 
engine of Reinman et al. [8]. An even more advanced helper engine is the instruc- 
tion co-processor described by Chou and Shen [9]. Helper engines have also been 
proposed for garbage collection [10] and correctness checking [11]. 

3. Co-Designed Virtual Machines 

Providing important support for the ILDP paradigm is the trend toward dynamic 
optimizing software and virtual machine technologies. A co-designed virtual 
machine is a combination of hardware and software that implements an instruction 
set, the virtual ISA [12,13,14]. Part of this implementation is hardware — which sup- 
ports an implementation specihc instruction set (Implementation Instruction Set 
Architecture, I-ISA). The other part of the implementation is in software — which 
translates the virtual instruction set architecture (V-ISA) to the I-ISA and which pro- 
vides the capability of dynamically re-optimizing a program. A co-designed VM is a 
way of giving hardware implementors a layer of software. This software layer 
liberates the hardware designer from supporting a legacy V-ISA purely in hardware. 
It also provides greater flexibility in managing the resources that make up a ILDP 
microarchitecture. 

With ILDP, resources must be managed with a high level view. The distributed pro- 
cessing elements must be coordinated, and Instructions and data must be routed in 
such a way that resource usage is balanced and communication delays (among depen- 
dent instructions) are minimized — as with any distributed system. This could 
demand high complexity hardware, if hardware alone were given responsibility. For 
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example, the hardware must be aware of some elements of program structure, such as 
data and control flow. However, traditional window-based methods give hardware 
only a restricted view of the program, and the hardware would have to re-construct 
program structure information by viewing the instruction stream as it flows by. 

Hence, in the co-designed VM paradigm, software is responsible for determining pro- 
gram structure, dynamically re-optimizing code, and making complex decisions 
regarding management of ILDP. Hardware implements the lower level performance 
features that are managed by software. Hardware also collects dynamic performance 
information and may trigger software when "unexpected" conditions occur. Besides 
performance, the VM can be used for managing resources to reduce power require- 
ments [13] and to implement fault tolerance. 

4. New Instruction Sets 

Thus far, the discussion has been about microarchitectures, but there are also instruc- 
tion set implications. Instruction sets should be optimized for ILDP. Using VM tech- 
nology enables new instruction sets at the implementation level. Features of new 
instruction sets should include focus on communication and dependence and 
emphasis on small, fast implementation structures, including caches and registers. 
For example, variable length instructions lead to smaller instruction footprints and 
smaller caches. 

Most recent instruction sets, including RISC instructions sets, and especially VLIW 
instruction sets, have emphasized computation and independence. The view was that 
higher parallelism could be achieved by focusing on computation aspects of instruc- 
tion sets and on placing independent instructions in proximity either at compile time 
or during execution time. For ILDP, however, instruction sets should be targeted at 
communication and dependence. That is, communications should be easily expressed 
and dependent instructions should be placed in proximity, to reduce communication 
delays. For example, a stack-based instruction set is an example of and ISA that 
places the focus on communication and dependence. Dependent instructions com- 
municate via the stack top; hence, communication is naturally expressed. Further- 
more stack-based IS As tend to have a small instruction footprints. Although stack 




Fig. 3. Supporting an instruction level distributed processor 
with a co-designed Virtual Machine. 
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instructions sets may have disadvantages, e.g. they may lead to more memory opera- 
tions, they do illustrate the point that dependence can be expressed in a clear way via 
the ISA. 

5. Summary and Conclusions 

In summary, technology shifts are forcing shifts in microarchitectures. Instruction 
level distributed processing will organized around small and fast core processors. 
These microarchitectures will contain distributed resources. Helper processors may 
be distributed around the main processing elements. These processors can execute 
highly parallel tasks and can be built from slower transistors to restrict static power 
consumption. Hence, the emphasis may shift from processor architecture to chip 
architecture where distribution and interconnection of resources will be key. 

Virtual machines fit very nicely in this environment. In effect, hardware designers 
can be given a layer of software that can be used to coordinate the distributed 
hardware resources and perform dynamic optimization from a higher level perspec- 
tive than is available in hardware alone. Finally, it is once again time that we recon- 
sider instruction sets with the focus on communication and dependence. New instruc- 
tions sets are needed to mesh with ILDP implementations and they are enabled by the 
VM paradigm which makes legacy compatibility unimportant at the I-ISA level. 
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Abstract. The emergence of semiconductor fabrication technology al- 
lowing a tight coupling between high-density DRAM and CMOS logic on 
the same chip has led to the important new class of Processor-in-Memory 
(PIM) architectures. Recent developments provide powerful parallel pro- 
cessing capabilities on the chip, exploiting the facility to load wide words 
in single memory accesses and supporting complex address manipulations 
in the memory. Furthermore, large arrays of PIMs can be arranged into 
massively parallel architectures. In this paper, we outline the salient fea- 
tures of PIM architectures and describe the design of an object-based pro- 
gramming and execution model centered on the notion of macroservers. 
While generally adhering to the conventional framework of object-based 
computation, macroservers provide special support for the efficient con- 
trol of program execution in a PIM array. This includes features for 
specifying the distribution and alignment of data in virtual object space, 
the binding of threads to data, and a future-based synchronization mech- 
anism. We provide a number of motivating examples and give a short 
overview of implementation considerations. 



1 Introduction 

“Processor in Memory” or PIM technology and architecture has emerged 
as one of the most important domains of parallel computer architecture 
research and development. It is being pursued as a means of accelerating 
conventional systems for array processing [22] and for manipulating irregular 
data structures [10]. It is being considered as a basis for scalable spaceborne 
computing [23], as smart memory to manage systems resources in a hybrid 
technology multithreaded architecture for ultra-scale computing [24], and 
most recently as the means for achieving Petaflops performance [14]. PIM 
exploits recent advances in semiconductor fabrication processes that enables 
the integration of DRAM cell blocks and CMOS logic on the same chip. The 
benefit of PIM structures is that processing logic can have direct access to the 
memory block row buffers at an internal memory bandwidth on the order of 
100 Gbps yielding the potential performance of 10 Gips (32-bit operands) on a 
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memory chip with a 16 Mbyte capacity. Because of the efficiencies derived from 
staying on-chip, power consumption can be an order of magnitude lower than 
comparable performance with conventional microprocessor based systems. But 
the dramatic advances in performance will be derived from arrays of tightly 
coupled PIM chips in the hundreds or thousands, either alone, or in conjunction 
with external microprocessors. Such systems could deliver low Teraflops scale 
peak performance within the next couple of years at a cost of only a few million 
dollars (or less than $1M if mass produced) and possibly a Petaflops, at least 
for some applications, in five years. 

The challenge to realizing the extraordinary potential of arrays of PIM is 
not simply the interesting problem of the basic on-chip structure and processor 
architecture but also the methodology for coordinating the synthesis of as much 
as a million PIM processors to engage in concert in the solution of a single 
parallel application. A large PIM array is not simply another MPP, it is a new 
balance of processing and memory in a new organization. Its local operation and 
global emergent behavior will be a direct reflection of a shared highly parallel 
system-wide model of computation that governs the execution and interactions 
of the PIM processors and chips. Such a computing paradigm must treat the 
semantic requirements of the whole system even as it derives its processing 
capabilities from the local mechanisms of the individual parts. A synergy of 
cooperating elements is to be accomplished through this shared execution model. 

PIM differs significantly from more common MPP structures in several key 
ways. The ratio of computation performance to associated memory capacity 
is much higher. Access bandwidth (to on-chip memory) is a hundred times 
greater. And latency is lower by a factor of two to four while logic clock 
speeds are approximately half that of the highest speed microprocessors. Like 
clusters, PIM favors data oriented computing where operations are scheduled 
and performed at the site of the data, and tasks are often moved from one 
PIM to another depending on where the argument data is rather than moving 
the data. PIM processor utilization is less important than memory bandwidth. 
A natural organization of computation on a PIM array is a binding of tasks 
and data segments logically to coincide with physical data allocation while 
making remote service requests where data is non-local. This is very similar to 
evolving practices for accomplishing tasks on the Web including the use of Java 
and encourages an object-oriented approach to managing the logical tasks and 
physical resources of the PIM array. 

This paper presents a strategy for relating the physical resources of next 
generation PIM arrays to the logical requirements of user defined applications. 
The strategy is embodied in an intermediate form of an execution model that 
provides the generalized abstractions of both local and global computation 
in a unified framework. The principal abstract entity of the proposed model 
is the macroserver, a distributed agent of state and action. It complements 
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the concept of the microserver, a purely local agent [3]. This early work 
explores one possible model that is object based in a manner highly suitable 
to PIM structures but of a sufficiently high level with task virtualization that 
aggregations of PIM nodes can be cooperatively applied to a segment of parallel 
computation without phase changes in representations (as would be found with 
Open MP combined with MPI). 

The next section describes PIM architectures including the likely direction 
of their evolution over the next one to three years. Then, in Sections 3 and 4, a 
description of macroservers, a PIM-oriented object-based distributed execution 
model is presented. Section 5 discusses the implications of this model for its im- 
plementation on the PIM including those that may drive architecture advances. 
The paper concludes with a summary of the model’s features, guided by a set 
of requirements, in Section 6, and an outlook to future work required to achieve 
the promise of this approach, in Section 7. 

2 Processor in Memory 

For more than a decade, research experiments have been conducted with 
semiconductor devices that merged both logic and static RAM cell blocks on 
the same chips. Even earlier, simple processors and small blocks of SRAM 
could be found on simple control processors for embedded applications and of 
course modern microprocessors include high speed SRAM on chip for level 1 
caches. But it was not until recently that industrial semiconductor fabrication 
processes made possible tightly coupled combinations of logic with DRAM cell 
blocks bringing relatively large memory capacities to PIM design. A host of 
research projects has been undertaken to explore the design and application 
space of PIM (many under DARPA sponsorship) culminating in the recent 
IBM announcement to build a Petaflops scale PIM array for the application of 
protein folding. 

The opportunity of PIM is primarily one of bandwidth. Typical memory 
parts access a row of memory from a memory block and then select a subsegment 
of the row of bits to be sent to a requesting processor through the external 
interface. While newer generations of memory chips are improving effective 
bandwidth, PIMs make possible immediate access to all the bits of a memory 
row acquired through the sense amps. Processing logic, placed at the row buffer, 
can operate on all the data read (typically 64 32-bit words) in a single memory 
access under favorable conditions. While a number of PIM proposals plan to use 
previously developed processor cores to be “dropped into” the die, PIM offers 
important opportunities for new processor architecture design that simplifies 
operation, lowers development cost and time, and greatly improves efficiency 
and performance over classical processor architecture. Many of the mechanisms 
incorporated in today’s processors are largely unnecessary in a PIM processor. 
At the same time, effective manipulation of the very wide words available on 
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the PIM imply the need for augmented instruction sets. 

PIM chips include several major subsystems, some of them replicated as 
space is available: 

— memory blocks 

— processor control 

— wide ALU and data path/register set 

— shared functional units 

— external interfaces 

Typically PIMs are organized into sets of memory block/processor pairs while 
sharing some larger functional units and the external interfaces among them [17]. 
Detailed design studies suggest that PIM processors comprise less than 20% of 
the available chip real estate while the memory capacity has access to more than 
half of the total space. Approximately a third of the die area is used for external 
I/O interface and control as well as shared functional units. This is an excellent 
ratio and emphasizes the value of optimizing for bandwidth utilization rather 
than processor throughput. An important advantage of the PIM approach is the 
ability to operate on all bits in a given row simultaneously. A new generation 
of very wide ALU and corresponding instruction sets exploit this high memory 
bandwidth to accomplish the equivalent of many conventional operations (e.g. 
32-bit integer) in a single cycle. An example of such a wide ALU is the ASAP ISA 
developed at the University of Notre Dame and used in such experimental PIM 
designs as Shamrock and MIND. Other fundamental advances over previous 
generation PIMs are also in development to provide unprecedented capability 
and applicability. Among the most important of these are on-PIM virtual to 
physical address translation, message driven computation, and multithreading. 

Virtual-to-Physical Address Translation Early PIM designs have been very sim- 
ple assuming a physically addressed memory and often a SIMD control struc- 
ture [9]. But such basic designs are limited in their applicability to a narrow 
range of problems. One requirement not satisfied by such designs is the ability 
to manipulate irregular data structures. This requires the handling of user vir- 
tual addresses embedded in the structure meta-data. PIM virtual to physical 
address translation is key to extending PIM into this more generalized domain. 
Translation Lookaside Buffers can be of some assistance but they are limited 
in scalability and may not be the best solution. Virtual address translation is 
also important for protection in the context of multitasking systems. Address 
translation mechanisms are being provided for both the USC DIVA chip and the 
HTMT MIND chip. As discussed later in Section 5, alternative approaches to 
PIM address translation have been developed that are both efficient and scalable 
including set associative and in-situ techniques. 

Message-Driven Computation A second important advance for PIM archi- 
tecture is message driven computation. Like simple memories, PIMs acquire 
external requests to access and manipulate the contents of memory cells. Unlike 
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simple memories, PIMs may have to perform complex sequences of operations 
on the contents of memory defined by user application or supervisor service 
routines. Mechanisms are necessary that provide efficient response to complex 
requests while maintaining generality. Message driven computation assumes a 
sophisticated protocol and on-chip fast interpretation mechanisms that quickly 
identify both the operation sequence to be performed and the data rows upon 
which to be operated. A general message driven low-level infrastructure goes 
beyond interactions between system processors and the incorporated PIMs, it 
permits direct PIM to PIM interactions and control without system processor 
intervention. This reduces the impact of the system processors as a bottleneck 
and allows the PIMs to exploit data level parallelism at the fine grain level 
intrinsic to pointer linked sparse and irregular data structures. Both the USC 
DIVA chip and the HTMT MIND chip will incorporate “parcel” message driven 
computation while the IBM Blue Gene chip will permit direct PIM to PIM 
communications as well. 



Multithreading A third important advance is the incorporation of multithreading 
into the PIM processor architecture. Although counter intuitive, multithreading 
actually greatly simplifies processor design rather than further complicating it 
because it provides a uniform hardware methodology for dynamically managing 
physical processor resources and virtual application tasks. Multithreading is also 
important because it permits rapid response to incoming service requests with 
low overhead context switching and also enables overlapping of computation, 
communication, and memory access activities, thus achieving much higher 
utilization and efficiency of these important resources. Multithreading also 
provides some latency hiding to local shared functional units, on-chip memory 
(for other processor/memory nodes on the chip), and remote service requests to 
external chips. The IBM Blue Gene chip and the HTMT MIND chip both will 
incorporate multithreading. 

Advanced PIM structures like MIND, DIVA, and Blue Gene require a so- 
phisticated execution model that binds the actions of the independent proces- 
sor/memory pairs distributed throughout the PIM array into a single coherent 
parallel/distributed computation. Some major requirements for this execution 
model are the following: 

1. Features for structuring and managing the global name space. 

2. Gontrol of object and data allocation, distribution, and alignment, with spe- 
cial support for sparse and irregular structures, as well as dynamic load 
balancing. 

3. A general thread model, as a basis for expressing a range of parallel and 
distributed execution strategies, with a facility for binding threads to data. 

4. Support for an efficient mapping of the model’s features to the underlying 
microserver/parcel mechanism and the operating system nucleus. 
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Additional requirements, which will not be further discussed in this paper, 
include the logical interface to I/O, protection issues, recovery from failed tasks 
and exceptional conditions, and the development of an API for code specification. 

Added to this should be the desirable features of hierarchy of abstraction, 
encapsulation, and modularity as well as simplicity and uniformity. The dis- 
tributed execution model must employ as its basis underlying mechanisms that 
can be implemented efficiently while retaining substantial generality and exten- 
sibility. The following model in its current inchoate state addresses most of these 
requirements. 

3 Macroservers: A Brief Overview 

We begin our concrete discussion of the macroserver model by providing a 
brief overview of the key concepts - macroserver classes and objects, state 
variables, methods and threads. Along the way we touch the relationship with 
the microserver model introduced by work at the University of Notre Dame 
and in the DIVA project [3,20]. 

A macroserver is an object that comes into existence by being created as 
an instantiation of a parameterized template called a macroserver class. Such 
a class contains declarations of variables and a set of methods defining its 
“behavior”. While the hardware architecture provides a shared address space, 
the discipline imposed by the object-based framework requires all accesses to 
external data to be performed via method calls, optionally controlled through 
a set of access privileges. At the time a macroserver is created, a region in the 
virtual PIM array memory is allocated to the new object. This allocation can 
be explicitly controlled in the model, by either directly specifying the region 
or aligning the object with an already existing one. A reference to the created 
object can be assigned to a new type of variable, called macroserver variable, 
which can act as a handle to the object. 

At any point in time, a macroserver (object) is associated with a state 
space in which a set of asynchronous threads is operating, each of which being 
the result of the spawning of a method. The data structures of a macroserver 
can be distributed across the memory region allocated to it. We provide 
explicit functionality for specifying the initial distribution of data and their 
incremental redistribution depending on dynamically arising conditions. While 
the basic ideas of this feature originate from data parallel languages [5,13], 
we have generalized this concept to include arbitrary distribution functions 
and to apply to general data structures such as those provided by LISP 
lists. Furthermore, the model offers functionality for controlling the loca- 
tion in memory where a thread is to be executed. Such bindings can be 
established dynamically and are particularly important for linking threads 
to data on which they operate as well as for dealing with irregular computations. 
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Threads are lightweight; they execute asynchronously as long as not subject 
to synchronization. Mutual exclusion can be controlled via atomic methods. A 
macroserver whose methods are all atomic is a monitor and can be used as a 
flexible instrument for scheduling access to resources. A “small” monitor can be 
associated with each element of a large data structure (such as a reservation 
system), co-allocating the set of variables required by the monitor with the 
associated element. This provides the basis for performing the scheduling in a 
highly efficient way in the ASAP. 

Threads can be synchronized using condition variables or futures. Condition 
variables [12] provide a simple and efficient low-level mechanism allowing 
threads to wait for synchronization conditions or signal their validity. Future 
variables, which are related to the futures in Multilisp [11], can be bound to 
threads and used for implicit or explicit synchronization based upon the status 
of the thread and a potential value yielded by it. 

Figure 1 uses a code fragment for a producer/consumer problem with 
bounded buffer to illustrate some of the general features of the model. We de- 
scribe a macroserver class, buffer_template, which is parameterized with an in- 
teger size. The class contains declarations for a data array fifo - the buffer data 
structure -, a number of related auxiliary variables, and two condition variables. 
Three methods - put, get, and geCeount - are introduced, the first two of them 
being declared atomic. We show how a macroserver object can be created from 
that class, and how to obtain access to the associated methods. 

Our notation is based on Fortran 95 [8] and ad-hoc syntax extensions mainly 
motivated by HPF [13] and Opus [7]. Note that we use this notation only as 
a means for communication in this paper, without any intent of specifying a 
concrete language syntax, but rather focusing on semantic constructs which can 
be embedded into a range of programming languages. 



4 Key Features of the Model 

In this section, we outline those features of the model that were specifically 
designed to support the efficient execution of parallel programs on massively 
parallel PIM arrays. We focus on the following topics, which are discussed in the 
subsections below. 

— control of object allocation 

~ distribution and alignment of data 

— thread model 

~ synchronization 



4.1 Object Allocation 

Some next-generation PIM architectures such as the MIND chip under develop- 
ment incorporate multithreading mechanisms for very low overhead task context 
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MACROSERVER CLASS buffer_template(size) / declaration of the 

! macroserver class buffer_template 
INTEGER :: size / declaration of the elass parameter 

! Declarations of the class variables: 

REAL :: fifo(0:size-l) 

INTEGER :: count = 0 
INTEGER :: px=0, cx=0 
CONDITION :: c_empty, cJull 

CONTAINS / Here follow the method declarations 

ATOMIC METHOD put(x) / put an element into the buffer 

REAL :: x 

END 

ATOMIC REAL METHOD get() ! get an element from the buffer 

END 

INTEGER METHOD get_count() / get the value of count 

END 



END MAGROSERVER CLASS buffer.template 

/ Main program: 

INTEGER buffersize 

MACROSERVER (buffer_template) my .buffer / declaration of the macroserver 

! variable my .buffer 

READ (buffersize) 

my.buffer= CREATE (buffer .template, buffersize) IN A4 (region) / Creation of a 
! macroserver object as an instance of class buffer.template, in a region of 
! virtual memory. A reference to that object is assigned to my .buffer. 

! A producer thread putting an item in the buffer: 

CALL my.buffer%put(...) ! Synchronous call of the method put in the macroserver 

! object associated with my .buffer 



Fig. 1. Skeleton for a producer/consumer problem 
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switching. A multithreaded architecture such as the TERA MTA [4] employs 
many concurrent threads to hide memory access latency. In contrast, multi- 
threading in PIMs is employed primarily as a resource management mechanism 
supporting dynamic parallel resource control with the objective of overlapping 
local computation and communication to maximize effective memory bandwidth 
and external I/O throughput. While the effects of latency are mitigated to some 
degree by the use of a small number of threads, latency in PIMs is primarily 
addressed by the tight coupling of wide ALU to the memory row buffers, by 
alignment of data placement, and by directed computation at the site of the 
argument data. 

Moreover, due to the relatively small size of the memory on a single PIM 
chip, large data structures must be distributed. As a consequence, our model 
provides features that allow the explicit control of the allocation of objects in 
virtual memory, either directly or relative to other objects, and the distribution 
of data in object space. 

Consider the creation of a new macroserver object. The size and structure 
of the data belonging to this object and the properties of the threads to be 
operating on these data may imply constraints regarding the size of the memory 
region to be allocated and an alignment with other, already allocated objects. 
Such constraints can be specified in the create statement. An example illustrating 
an explicit assignment of a region in virtual PIM memory has been shown in 
Figure I. A variant of this construct would allocate the new object in the same 
region as an already existing object: 

my .buffer = CREATE (buffer.template, ...) ALIGN (my_other_buffer) 

One PIM node in the memory region allocated to a macroserver is distin- 
guished as its home. This denotes a central location where the “metadata” 
needed for the manipulation of the object are stored, exploiting the ability of 
the ASAP to deal with compact data structures that fit into a wide word in a 
particularly efficient way ^ . 

4.2 Distribution and Alignment of Data 

A variable that is used in a computation must be allocated in the region allocated 
to its macroserver. We call the function describing the mapping from the basic 
constituents of the variable to locations in the PIM region the distribution of the 
variable. Macroservers allow explicit control of distribution and alignment. 

Variable distributions have been defined in a number of programming lan- 
guage extensions, mostly in the context of SIMD architectures and data parallel 
languages targeted towards distributed-memory multiprocessors (DMMPs) . Ex- 
amples include Vienna Fortran, HPF, and HPF-I- [5,13,6]. We generalize these 

^ For complex data structures the metadata will be usually organized in a hierarchical 
manner rather than with the simple structure suggested here. 
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approaches as discussed below. Some of these ideas have been also proposed for 
actor languages [21,16]. 

— Our approach applies not only to the traditional “flat” Fortran data struc- 
tures such as multidimensional arrays but also covers arbitrary list struc- 
tures such as those in LISP. This extension is essential since the hardware- 
supported microserver mechanism for PIMs allows highly efficient processing 
of variant-structure irregular data, as outlined in the “list crawler” example 
discussed in [3] . 

— We generalize the distribution functions proposed in Vienna Fortran to “dis- 
tribution objects” , which allow arbitrary mappings of data components to a 
PIM memory region, and in addition provide special support for sparse ma- 
trix representations. Such mappings can be performed dynamically, and they 
may be established in a partitioning tool which is linked to the macroserver. 
This generalization is crucial for the processing of irregular problems. 

— More formally, we associate each variable v with an index domain, I, which 
serves to provide a unique “name” for each component of v. For example, 
if u is a simple variable such as logical, integer, or real, then we choose for 
the index domain the singleton set I = {!}. If u = A{1 : n, 1 : m) is a 
two-dimensional array, then we can define I = [1 : u] x [1 : mj. Similarly, we 
can use strings of the form i\.i 2 • ■ • -*fc, where all ij are between 1 and n, to 
access a leaf of an n-ary tree of height k. 

For fixed-structure variables (such as those in Pascal, Fortran, or C which 
do not involve pointers), the index domain can be determined at the time 
of allocation, and is invariant thereafter. Otherwise, such as for LISP data 
structures, the index domain changes according to the incremental modifi- 
cations of the data structure. 

Based on index domains, we can define the distribution of a variable v with 
index domain I as a total function : I — > 7?., where TZ denotes the region 
in PIM memory allocated to the macroserver to which the variable belongs. 
Here, we disregard replication for reasons of simplicity. Given v, 5"" and a 
particular memory region, R C TZ, the set of elements of v mapped to R is 
called the distribution segment of i?: {i G I | G R}. R is called the 
home of all elements of v in the distribution segment. 

~ The distribution mechanism is complemented by a simple facility for the 
alignment of data structures in PIM memory. Distribution and alignment 
are designed as fully dynamic mechanisms, allowing redistribution and re- 
alignment at execution time, depending on decisions made at runtime. 



4.3 Threads 

At any time, zero or more threads may exist in a given macroserver; different 
threads - within one or different macroservers - may execute asynchronously in 
parallel unless subject to mutual exclusion or synchronization constraints. All 
threads belonging to a macroserver are considered peers, having the same rights 
and sharing all resources allocated to the macroserver. 
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At the time of thread creation, a future variable [11] may be bound to the 
thread. Such a variable can be used to make inquiries about the status of the 
thread, retrieve its attributes, synchronize the thread with other threads, and 
access its value after termination. 

In contrast to UNIX processes, threads are lightweight, living in user 
space. Depending on the actual method a thread is executing, it may be ultra 
lightweight, carrying its context entirely in registers. On the other hand, a thread 
may be a significant computation, such as a sparse matrix vector multiply, 
with many levels of parallel subthreads. The model provides a range of thread 
attributes that allows a classification of threads according their weight and 
other characteristics. Examples include properties such as “non-preemptive” 
and “detached” [19]. 

Threads can be either spawned individually, or as members of a group. 
The generality and simplicity of the basic threading model allows in principle 
the dynamic construction of arbitrarily linked structures; it can be also used 
to establish special disciplines under which a group of threads may operate. 
An important special case is a set of threads that cooperate according to the 
Single-Program-Multiple-Data (SPMD) execution model that was originally 
developed for DMMPs. The SPMD model is a paradigm supporting “loosely 
synchronous” parallel computation which we perceive as one of the important 
disciplines for programming massively parallel PIM arrays. For example, linear 
algebra operations on large distributed dense or sparse data structures are 
particularly suitable for this paradigm and can be implemented in a similar way 
as for DMMPs. However, the efficient data management facilities in the ASAP 
and the ability to resolve indirect accesses directly in the memory allow a more 
efficient implementation for PIM arrays. 

At the time a thread is created it can be bound to a specific location 
in the memory region associated with its macroserver. Usually that location 
is not specified explicitly but rather indirectly, referring to the home of a 
data structure on which the thread is to operate. This facility of binding 
computations to data is orthogonal to the variable distribution functionality, 
but can be easily used in conjunction with it. 

We use a variant of the Fortran 95 forall statement to indicate the parallel 
creation of a set of threads all executing the same method. Consider the following 
simple example: 

FORALL THREADS (1=1:100, J=l:100, ON HOME (A(I,J))) 

F(I,J)= SPAWN (intra_block_transpose,I,J) 

This statement has the following effect: 

— 10000 threads, say t(I, J), 1 < /, J < 100 are created in parallel. 
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~ Each thread t(J, J) activates method intra_block_transpose with arguments / 
and J. 

— For each I and J, thread t(/, J) is executed in the ASAP which is the home 
of array element A{I, J). 

— For each I and J, the future variable F{I, J) is assigned a reference to t(/, J). 

The above example is a simplified code fragment for a parallel algorithm that 
performs the transposition of a block-distributed two-dimensional matrix in two 
steps, first transposing whole blocks, and then performing an intra-block trans- 
position [18]. In the code fragement of Fig. 2, matrix A is dynamically allocated 
in the PIM memory, depending on a runtime-determined size specification. The 
blocksize is chosen in such a way that each block fits completely into one wide 
word of the ASAP (8 by 8 elements). We make the simplifying assumption that 
8 divides the number of elements in one dimension of the matrix. 

We illustrate the second step of this algorithm. Parallelism is controlled by 
explicit creation and synchronization of threads. Each thread executing a call to 
intraJ)lockJ,ranspose can be mapped to a permutation operation directly sup- 
ported in the ASAP hardware [3]. 

4.4 Synchronization 

Our model supports mutual exclusion and condition synchronization [2] , in both 
cases following a conventional low-level approach. Mutual exclusion guarantees 
atomic access of threads to a given resource, leaving the order in which blocked 
threads are released implementation specific. It can be expressed using atomic 
methods similar to those in CC-I — h [15] or Java. Condition synchronization is 
modeled after Hoare’s monitors [12]. Condition variables can be associated 
with programmed synchronization conditions; threads whose synchronization 
condition at a given point is not satisfied can suspend themselves with respect 
to a condition variable, waiting for other threads to change the state in such a 
way that the condition becomes true. 

The example code in Fig. 3 sketches the declaration of a monitor sched- 
ulerJ.emplate that can be used for controlling the access of a set of reader 
and writer threads to a data resource. Here, WAIT c blocks a thread with 
respect to condition variable c, EMPTY c is a predicate that yields true iff 
no thread is blocked with respect to c, and SIGNAL c releases a thread from 
the queue associated with c, if it is not empty. Each reader access must be 
enclosed by the calls begimread and endjread, and analogously for writers. The 
rest of the program illustrates how in a large data structure (in this case, a 
flight reservation system) each element can be associated with its own monitor; 
these monitors could be parameterized. Furthermore, the alignment clause 
attached to the create statement guarantees that the variables of each monitor 
are co-located with the associated element of the flight data base. This allows 
highly efficient processing of the synchronization in the ASAP. 
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PROGRAM MATRIX.TRANSPOSE 

MACROSERVER CLASS t.class 
INTEGER :: N 

EUTURE, ALLOCATABLE :: F(,) /future variable array 

REAL, ALLOCATABLE, DISTRIBUTE ( BLOCK(8) , BLOCK(8) )::A(,) 
CONTAINS 



METHOD initialize() 

METHOD intra_block_transpose(II,JJ) 

INTEGER :: II, JJ 
INTEGER :: I, J 

EORALL (1=11:11+7, J=JJ:JJ+7) A(I,J) = A(J,I) 
END intra_block_transpose 



END MACROSERVER CLASS t.class 

/ Main Program 

INTEGER :: II, JJ 

MACROSERVER (t_class) my_transpose = CREATE (t_class) 

CALL my_transpose%initialize() 

/ exchange whole blocks 

! for each block, activate intra_block_transpose as a separate thread. The thread 
! associated with {II, JJ) executes in the PIM storing the matrix block A{II,JJ). 
EORALL THREADS (II=l:N-7:8, JJ=l:N-7:8, ON HOME (A(II,JJ))) 
F(II,JJ)= SPAWN (intra_block_transpose,II,JJ) 

WAIT ( ALL (F)) / barrier 

END PROGRAM MATRIX.TRANSPOSF 



Fig. 2. Matrix Transpose 



In addition to the synchronization primitives discussed above, our model 
proposes special support for synchronization based on future variables similar to 
those introduced in Multilisp [II]. This takes two forms, explicit and implicit. 

Explicit synchronization can be formulated via a version of the wait statement 
that can be applied to a logical expression depending on futures. Wait can also 
be used in the context of a forall threads statement, as shown in the example of 
Figure 2. 

Implicit synchronization is automatically provided if a typed future variable 
occurs in an expression context that requires a value of that type. 
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MACROSERVER CLASS MONITOR schedulerTemplate 
INTEGER wr=0, ww=0 
CONDITION c_R, c_W 
CONTAINS 
METHOD begin_read() 

DO WHILE ((ww GT 0) AND (NOT EMPTY (c_W))) WAIT cJl 
wr = wr + 1 
END begin_read 

METHOD endj-eadO 
wr = wr-1 

IF wr == 0 SIGNAL (c_W) 

END end_read 

METHOD begin_write() 

METHOD end_write() 



END MACROSERVER CLASS schedulerTemplate 

/ Main program 

INTEGER:: N, I, K 

TYPE flightjecord / sketch of data structure for a flight record 

INTEGER :: date, time, flight Jiumber, create_status 
MACROSERVER (scheduler Template) my .scheduler 

END TYPE flightj-ecord 

TYPE (flight j-ecord), ALLOCATABLE ,... ONTO A1(region)::flights(:) 
ALLOCATE (flights(N)) 

DO I=1,N 

flights(I)%my.scheduler= CREATE (schedulerTemplate,create.status(I)) 

ALIGN (HOME (flights(I))) 



END DO 

/ write access to element flight(K) in a thread: 

K=... 

CALL flights (K)%my.scheduler%begin_write() 
/ write 

CALL flights (K)%my.scheduler%end_write() 



Fig. 3. Fine-grain scheduling 
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5 Implications for Implementation 

The semantic constructs of the execution model have been carefully selected 
both to provide a powerful framework for specifying and managing parallel 
computations and to permit efficient implementation of the primitive mecha- 
nisms comprising it. In this section, we discuss some of the most important 
implementation issues underlying the basic support for the execution model. 
Effective performance of the basic primitives ensures overall computational 
efficiency. 

Parcels are messages that convey both data identifiers and task specifiers 
that direct work at remote sites and determine a continuation path of execution 
upon completion. Similar to active messages, parcels direct all inter-PIM trans- 
actions. An arriving parcel may cause something as simple as a memory read 
operation to be performed or something as complex as a search through all PIM 
memory for a particular key field value. Parcel assimilation therefore becomes 
a critical factor in the effective performance obtainable from a given PIM. 
Parcels must be variable length such that commands for primitive operations 
can be acquired quickly with only the longer executing tasks requiring longer 
parcel packets. A single access can take as little as 40 nanoseconds. Assuming 
128-bit parcels for operation, address, and data (for writes in and read outs) 
and 1 Gbps unidirectional pins, read and/or write operations can be streamed 
using byte serial input and output parcel communications, sustaining peak 
memory bandwidth. More complex operations will demand somewhat longer 
packets but involve multiple memory accesses, thus overlapping communications 
with computation. Parcels can be stored in row-width registers and processed 
directly by the wide ALU of the PIM processor. 

Address translation from virtual to physical may be made efficient through 
a combination of techniques. One uses set associative techniques similar to some 
cache schemes. A virtual page address is restricted to one or a small set of PIMs 
which contain the rest of the address translation through a local page table. 
Thus, any virtual page address reference can be directed to the correct PIM (or 
PIMs) and the detailed location within the PIM set is determined by a simple 
table look up. Associative scanning is performed efficiently because a row wide 
simple ALU allows comparisons of all bits or fields in a row simultaneously 
at memory cycle rates. These wide registers can be used as pseudo TLBs 
although they are in the register name space of the PIM processor ISA and 
therefore can be the target of general processing. Such registers can be used 
to temporarily hold the physical addresses of heavily used methods. Because 
the opcode and page offsets on a chip are relatively few, a single row register 
might hold 128 translation entries which could be checked in a single logic 
cycle of 10 nanoseconds, quite possibly covering all procedures stored on the 
PIM. A second technique to be employed is in-situ address translation. A 
complex data structure comprising a tree of pointer links includes both the 
virtual address of the pointer and the physical address. As a logical node in the 
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data structure is retrieved from memory, the wide row permits both physical 
and virtual addresses to be fetched simultaneously. As page migration occurs 
relatively slowly, updating the contents of physical pointer fields can be done 
with acceptable overhead. 

The multithreading scheduling mechanism allows the processing of one 
parcel request on register data to be conducted while the memory access for 
another parcel is being undertaken and quite possibly one or more floating 
point operations are propagating through the shared pipelined units. The 
multithreading mechanism’s general fine grain resource management adapts 
to the random incidence of parcels and method execution requirements to 
provide efficient system operation. Each active thread is represented in a thread 
management register. A wide register is dedicated to each thread, holding the 
equivalent of 64 4- byte words. 

Synchronization is very general as an array of compound atomic operations 
can be programmed as simple methods. A special lock is set for such operations 
such that thread context switches do not interfere with atomicity, ensuring cor- 
rectness. The future-based synchronization operations require more than simple 
bit toggling. The execution record of a thread bound to a future variable stores 
pointers to sources of access requests when the solicited value has yet to be 
produced. Upon completion of the necessary computation or value retrieval, the 
buffered pointer redirects the result value to the required locations. Thus a future 
involves a complex data structure and its creation, manipulation, and elimina- 
tion upon completion. This can be a very expensive operation using conventional 
processors and memories. But the PIM architecture with direct access to a 256- 
byte record (single row) allows single cycle (memory cycle) create, update, and 
reclaim operations making this form of synchronization very efficient. 

6 Revisiting the Requirements 

We started our discussion of the macroserver model by indicating four major 
requirements at the end of Section 2. The material in Sections 3 and 4 ad- 
dressed these issues on a point-to-point basis in the context of the description 
of macroserver semantics. Here we summarize this discussion, guided by the set 
of requirements. 

Structuring and Managing the Name Space 

PIM arrays support a global name space. The macroserver model structures 
this space in a number of ways. First macroserver classes encapsulate data and 
methods, providing a “local” scope of naming. Second, each macroserver object 
establishes a dynamic naming scope for its instance of data and methods. Finally 
(this is a point not further mentioned in the paper), each macroserver has a set of 
“acquaintances” [1] and related access privileges, which allow a system designer 
to establish domains of protection. 
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Object and Data Allocation 

Controlling the allocation, distribution, and alignment of objects and data is a 
key requirement for the model. Macroservers provide the following support for 
these features: 

~ At the time of object creation, constraints for the allocation of the object 
representation may be specified. Such constraints include size of representa- 
tion, explicit specification of an area in the PIM array, or alignment with an 
already existing object. 

~ A general mechanism for the distribution of data in the memory region 
allocated to an object has been developed. This includes arbitrary mappings 
from elements of a data structure to PIMs and provides special support for 
sparse arrays. 

— Distributions may be determined at runtime, supporting dynamic allocation 
and re-allocation of data and providing support for dynamic load balancing 
strategies. 

Thread Model 

Asynchronous threads are generated by the spawning of a method. At any given 
time, a set of lightweight threads, which can be activations of the same or of 
different methods, may operate in a macroserver. 

The “weight” of threads may cover a broad range, from ultra-lightweight -- 
with the context being kept completely in registers ~ to fairly heavy computa- 
tions such as a conjugent gradient algorithm. 

Threads can be spawned as single entities or as elements of a group. The 
simplicity of the basic concept and its generality allow for a range of parallel 
execution strategies to be formulated on top of the basic model. 

An important aspect supporting locality of operation is a feature that allows 
the binding of threads to data: at the time of thread generation, a constraint for 
the locus of execution of this thread in PIM memory can be specified. 

Efficiency of Implementation 

A key requirement for an execution model is performance: the ability to map the 
elements of the model efficiently to the underlying PIM array. The macroserver 
model addresses this issue by providing a sophisticated set of control mecha- 
nisms (many of which are too low-level to be reflected in a high-level language 
specification) . A more detailed discussion of implementation support issues was 
conducted in the previous section. 

7 Conclusion 

This paper discussed macroservers, an object-based programming and execution 
model for Processor-in-Memory arrays. Based upon the major requirements as 
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dictated by the salient properties of PIM-based systems such as HTMT MIND, 
the DIVA chip, and IBM’s Blue Gene, we worked out the key properties of 
the model addressing these requirements. A more detailed description of the 
macroserver model can be found in [26]; its application to irregular problems is 
dealt with in [27]. 

Future work will focus on the following tasks: 

— completing a full specification of the model 

— applying the model to a range of application problems, in particular sparse 
matrix operations and unstructured grid algorithms, 

— developing an implementation study, and 

— re-addressing a number of research issues such as 

• higher-level synchronization mechanisms, 

• the management of thread groups and related collective operations, 

• support for recovery mechanisms in view of failing nodes or exceptional 
software conditions, and 

• the I/O interface. 

In addition, we will study the interface of the model to high-level languages 
and related compiler and runtime support issues. 
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Abstract. For the past two decades, developments in DRAM technology, 
the primary technology for the main memory of computers, have been direct- 
ed towards increasing density. As a result 256 M-bit memory chips are now 
commonplace, and we can expect to see systems shipping in volume with 1 
G-bit memory chips within the next two years. Although densities of 
DRAMs have quadrupled every 3 years, access speed has improved much 
less dramatically. This is in contrast to developments in processor technology 
where speeds have doubled nearly every two years. The resulting “memory 
gap’’ has been widely commented on. The solution to this gap until recently 
has been to use caches. In the past several years, DRAM manufacturers have 
explored new DRAM structures that could help reduce this gap, and reduce 
the reliance on complex multilevel caches. The new structures have not 
changed the basic storage array that forms the core of a DRAM; the key 
changes are in the interfaces. This paper presents an overview of these new 
DRAM structures. 

1 Introduction 

For the past two decades developments in DRAM technology, the primary technology 
for the main memory of computers, have been directed towards increasing density. As 
a result 256 M-bit memory chips are now commonplace, and we can expect to see sys- 
tems shipping in volume with 1 G-bit memory chips within the next two years. Many 
of the volume applications for DRAM, particularly low cost PCs, will thus require only 
one or two chips for their primary memory. Ironically, then, the technical success of the 
commodity DRAM manufacturers is likely to reduce their future profits, because the 
fall in units shipped is unlikely to be made up by unit price increases. 

Although densities of DRAMs have quadrupled every 3 years, access speed has im- 
proved much less dramatically. This is in contrast to developments in processor tech- 
nology where speeds have doubled nearly every two years. The resulting “memory gap” 
has been widely commented on. The solution to this gap until recently has been to use 
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caches. In the past several years, DRAM manufacturers have explored new DRAM 
structures that could help reduce this gap, and reduce the reliance on complex multilevel 
caches. These structures are primarily new interfaces that allow for much higher band- 
widths to and from chips. In some cases the new structures provide lower latencies too. 
These changes may also solve the problem of diminishing profits be allowing DRAM 
companies to charge a premium for the new parts. This paper will overview some of the 
more common new DRAM interfaces. 

2 DRAM Architectures — Background 

DRAMs are currently the primary memory device of choice because they provide the 
lowest cost per bit and greatest density among solid-state memory technologies. When 
the 8086 was introduced, DRAM were roughly matched in cycle time with microproc- 
essors. However, as we noted, since this time processor speeds have improved at a rate 
of 80% annually, DRAM speeds have improved at a rate of 7% annually [1]. In order 
to reduce the performance impact of this gap, multiple levels of caches have been add- 
ed, and processors have been designed to prefetch and tolerate latency. 

The traditional asynchronous DRAM interface underwent some limited changes in 
response to this growing gap. Examples were fast-page-mode (FPM), extended-data- 
out (EDO), and burst-EDO (BEDO), each provided faster cycle times if accesses were 
from the same row and thus more bandwidth than the predecessor. 

In recent years more dramatic changes to the interface have been made. These built 
on the idea, first exploited in fast-page-mode devices, that each read to a DRAM actu- 
ally reads a complete row of bits or word line from the DRAM core into an array of 
sense amps. The traditional asynchronous DRAM interface would then select through 
a multiplexer a small number of these bits (xl, x4, x8 being typical). This is clearly 
wasteful and by clocking the interface it is possible to serially read out the entire row or 
parts of the row. This describes the now popular synchronous DRAM (SDRAM), the 
basic two-dimensional array of bit cells that forms the core of every DRAM is un- 
changed, although its density can continue to undergo improvement. The clock runs 
much faster than the access time and thus the bandwidth of memory accesses are greatly 
increased provided they are to the same word line. 

There are now a number of improvements and variants on the basic single-data-rate 
(SDR) SDRAM. These include double-data-rate (DDR) SDRAM, direct Rambus 
DRAM (DRDRAM), and DDR2 which is under development by a number of manufac- 
turers. DDR signalling is simply a clocking enhancement where data is driven and re- 
ceived on both the rising and the falling edge of the clock. DRDRAM includes high 
speed DDR signalling over a relatively narrow bus to reduce pin count and a high level 
of banking on each chip. The core array itself can be subdivided at the expense of some 
area. This multibanking allow for several outstanding requests to be in flight at the same 
time, providing an opportunity to increase bandwidth through pipelining of the unique 
bank requests. Additional core enhancements, which may be applied to many of these 
new interface specifications include Virtual Channel (VC) caching. Enhanced Memory 
System (EMS) caching and East-Cycle (EC) core pipelining. These additional improve- 
ments provide some form of caching of recent sense amp data. Even with these signif- 
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icant redesigns, the cycle time — as measured by end-to-end access latency — has con- 
tinued to improve at a rate significantly lower than microprocessor performance. These 
redesigns have been successful at improving bandwidth, but latency continues to be a 
constrained by the area impact and cost pressures on DRAM core architectures [2]. 

In the next section we will give an introduction to some of the most popular new 
DRAMs. 

3 The New DRAM Interfaces 

The following sections will briefly discuss a variety of DRAM architectures. The 
first four are shown in Table 1 . The min/max access latencies given in this table assume 



Table 1: Synchronous DRAM Interface Characteristics 





PCIOO 


DDR266 

(PC2100) 


DDR2 


DRDRAM 


Potential 

Bandwidth 


0.8 GB/s 


2.133 GB/s 


3.2 GB/s 


1.6 GB/s 


Interface 

Signals 


64(72) data 
168 pins 


64(72) data 
168 pins 


64(72) data 
184 pins 


16(18) data 
184 pins 


Interface 

Frequency 


100 MHz 


133 MHz 


200 MHz 


400MHz 


Fatency Range 


30-90 nS 


18.8-64 nS 


17.5-42.6 nS 


35-80 nS 



that the access being scheduled has no bus utilization conflicts with other accesses. 

The last two sections (ESDRAM and FCDRAM) discuss core enhancements which 
can be applied to a core which can be mated to almost any interface, and the last section 
covering graphics DRAM reflects the special considerations that this application re- 
quests from DRAM. 

3.1 SDR SDRAM (PCIOO) 

The majority of desktop PC’s shipped in 1Q2000 use SDR PCIOO SDRAM. SDR 
DRAM devices are currently available at 133 Mhz (PC133). The frequency endpoint 
for this line of SDRAMs is in question, though PC150 and PC166 are almost certain to 
be developed. As we noted, SDRAM is a synchronous adaptation of the prior asynchro- 
nous FPM and FDO DRAM architectures that streams or burst s data out under the syn- 
chronously with a clock provided the data is all from the same row of the core. The 
length of the burst is programmable up to the maximum size of the row. The clock (133 
MHz in this case) typically runs nearly an order of magnitude faster than the access time 
of the core. As such, SDRAM is the first DRAM architecture with support for access 
concurrency on a single shared bus. Earlier non-synchronous DRAM had to support ac- 
cess concurrency via externally controlled interleaving. 
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3.2 DDR SDRAM (DDR266) 

Earlier we noted that DDR SDRAM differs from SDR in that unique data is driven and 
sampled at both the rising and falling edges of the clock signal. This effectively doubles 
the data bandwidth of the bus compared to an SDR SDRAM running at the same clock 
frequency. DDR266 devices are very similar to SDR SDRAM in all other characteris- 
tics. They use the same signalling technology, the same interface specification, and the 
same pinouts on the DIMM carriers. The JEDEC specification for DDR266 devices 
provides for a number of “CAS-latency” speed grades. Chipsets are currently under de- 
velopment for DDR266 SDRAM and are expected to reach the market in 4Q2000. 

3.3 DDR2 

The DDR2 specification under development by the JEDEC 42.3 Euture DRAM Task 
Group is intended to be the follow-on device specification to DDR SDRAM. While 
DDR2 will have a new pin-interface, and signalling method (SSTL), it will leverage 
much of the existing engineering behind current SDRAM. The initial speed for DDR2 
parts will be 200 MHz in a bussed environment, and 400Mhz in a point-to-point appli- 
cation, with data transitioning on both edges of the clock [3]. Beyond strictly the ad- 
vancement of clock speed, DDR2 has a number of interface changes intended to enable 
faster clock speeds or higher bus utilization. The lower interface voltage (1.8 V), differ- 
ential clocking and micro-BGA packaging are all intended to support a higher clock rate 
on the bus. Specifying a write latency equal to the read latency minus one (WL = RL- 
1) provides a time profile for both read and write transactions that enables easier 
pipelining of the two transaction types, and thus higher bus utilization. Similarly, the 
addition of a programmable additive latency (AL) postpones the transmission of a CAS 
from the interface to the core. This is typically referred to as a “posted-CAS” transac- 
tion. This enables the RAS and CAS of a transaction to be transmitted by the controller 
on adjacent bus cycles. Non-zero usage of the AL parameter is best paired with a 
closed-page-autoprecharge controller policy, because otherwise open-page-hits incur 
an unnecessary latency penalty. The burst length on DDR2 has been fixed at 4 data bus 
cycles. This is seen as a method to simplify the driver/receiver logic, at the expense of 
heavier loading on the address signals of the DRAM bus [4]. The DDR2 specification 
is not finalized, but the information contained here is based upon the most recent drafts 
for DDR2 devices and conversations with JEDEC members. 

3.4 Direct Rambus (DRDRAM) 

Direct Rambus DRAM (DRDRAM) devices use a 400 Mhz 3-byte-wide channel (2 for 
data, 1 for addresses/commands). DRDRAM devices use DDR signalling, implying a 
maximum bandwidth of 1.6 G-bytes/s, and these devices have many banks in relation 
to SDRAM devices of the same size. Each sense-amp, and thus row buffer, is shared 
between adjacent banks. This implies that adjacent banks cannot simultaneously main- 
tain an open-page, or maintain an open-page while a neighboring bank performs an ac- 
cess. The increased number of banks for a fixed address space has the result of increas- 
ing ability to pipeline accesses due to the reduced probability of sequential accesses 
mapping into the same bank. The sharing of sense-amps increases the row-buffer miss 
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rate as compared to having one open row per bank, but it reduces the cost by reducing 
the die area occupied by the row buffer [5]. 

3.5 ESDRAM 

A number of proposals have been made for adding a small amount of SRAM cache onto 
the DRAM device. Perhaps the most straightforward approach is advocated by En- 
hanced Memory Systems (EMS). The caching structure proposed by EMS is a single 
direct-mapped SRAM cache line, the same size as the DRAM row, associated with each 
bank. This allows the device to service accesses to the most recently accessed row, re- 
gardless of whether refresh has occurred and enables the precharge of the DRAM array 
to be done in the background without affecting the contents of this row-cache. This ar- 
chitecture also supports a no-write-transfer mode within a series of interspersed read 
and write accesses. The no-write-transfer mode allows writes to occur through the 
sense-amps, without affecting the data currently being held in the cache-line associated 
with that bank [6]. This approach may be applied to any DRAM interface, PCIOO inter- 
face parts are currently available and DDR2 parts have been proposed. 

3.6 FCDRAM 

Fast Cycle DRAM (FCRAM) developed by Fujitsu is an enhancement to SDRAM 
which allows for faster repetitive access to a single bank. This is accomplished by di- 
viding the array not only into multiple banks but also small blocks within a bank. This 
decreases each block’s access time due to reduced capacitance, and enables pipelining 
of requests to the same bank. Multistage pipelining of the core array hides precharge, 
allowing it to occur simultaneously with input-signal latching and data transfer to the 
output latch. FCDRAM is currently sampling in 64M-bit quantities, utilizing the 
JED EC standard DDR SDRAM interface, but is hampered by a significant price premi- 
um based upon the die area overhead of this technique [7]. Fujitsu is currently sampling 
FCDRAM devices which utilize both SDR and DDR SDRAM interfaces, additionally 
low-power devices targeted at the notebook design space are available. 

4 Conclusions 

DRAM are widely used because they provide a highly cost effective storage solution. 
While there are a number of proposals for technology to replace DRAM, such as 
SRAM, magnetic RAM (MRAM) [8] or optical storage [9], the new DRAM technolo- 
gies remain the volatile memory of choice for the foreseeable future. Increasing the per- 
formance of DRAM by employing onboard cache, and interfaces with higher utilization 
or smaller banks may impact the cost of the devices, but it could also significantly in- 
crease the performance which system designers are able to extract from the primary 
memory system. 

The DRAM industry has been very conservative about changing the structure of 
DRAMs in even the most minor fashion. There has been a dramatic change in this atti- 
tude in the past few years and we are now seeing a wide variety of new organizations 
being offered. Whether one will prevail and create a new standard commodity part re- 
mains to be seen. 
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Abstract. In December 1999, IBM Research announced a five-year, 
$100M research project, code named Blue Gene, to build a petafiop 
computer which will be used primarily for research in computational 
biology. This computer will be 100 times faster than the current fastest 
supercomputer, ASCI White. The Blue Gene project has the potential 
to revolutionize research in high-performance computing and in compu- 
tational biology. 

To reach a petafiop, Blue Gene interconnects approximately one million 
identical and simple processors, each capable of executing at a rate of 
one gigafiop. Each of the 25 processors on a single chip contains a half 
megabyte of embedded DRAM, which is shared among those processors. 
They communicate through a system of high speed orthogonal opposing 
rings. The approximately 40,000 chips communicate by message passing. 
The configuration is suitable for highly parallel problems that do not 
require huge amounts of memory. Such problems can be found in com- 
putational biology, high-end visualization, computational fluid dynamics, 
and other areas. 

This talk will be primarily about the Blue Gene hardware and system 
software. We will also briefly discuss the protein folding application. 
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Abstract. The Science and Technology Agency of Japan has proposed 
a project to promote studies for global change prediction by an 
integrated three-in-one research and development approach: earth 
observation, basic research, and computer simulation. As part of the 
project, we are developing an ultra-fast computer, the "Earth 
Simulator", with a sustained speed of more than 5 TFLOPS for an 
atmospheric circulation code. The "Earth Simulator" is a MIMD type 
distributed memory parallel system in which 640 processor nodes are 
connected via fast single-stage crossbar network. Earth node consists of 
8 vector-type arithmetic processors which are tightly connected via 
shared memory. The peak performance of the total system is 40 
TFLOPS. As part of the development of basic software system, we are 
developing an operation supporting software system what is called a 
“center routine”. The total system will be completed in the spring 
of2002. 



1 Introduction 

The Science and Technology Agency of Japan has proposed a project to promote 
studies for global change prediction by an integrated three-in-one research and 
development approach: earth observation, basic research, and computer simulation. It 
goes without saying that basic process and observation studies for global change are 
very important. Most of these basic processes, however, are tightly coupled and form 
a typical complex system. A large-scale simulation in which the coupling between 
these basic processes are taken into consideration is the only way for a complete 
understanding of this kind of complicated phenomena As part of the project, we are 
developing an ultra-fast computer named the "Earth Simulator". 

The Earth Simulator has two important targets, one is the applications to the 
atmospheric and oceanographic science and the other is the applications to the solid 
earth science. For the first applications, high resolution global, regional and local 
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models will be developed and for the second a global dynamic model to describe the 
entire solid earth as a system and a simulation model of earthquake generation process 
etc., will be developed. 



2 Outline of the Hardware System 

Taking as an example of a global AGCM (atmospheric general circulation model), 
here we consider the requirements for computational resources for the Earth 
Simulator. Present typical global AGCM uses about a 100km mesh in both 
longitudinal and latitudinal directions. The mesh size will be reduced to 10km in the 
high resolution global AGCM on the Earth Simulator. The number of layers will also 
be enhanced up to several to 10 times that of the present model. According to the 
resolution level, the time integration mesh must be reduced. Taking all these 
conditions into account, both the CPU and main memory of the Earth Simulator must 
be at least 1000 times lager than those of present computers. The effective 
performance of present typical computers is about 4-6 GFLOPS. Therefore, we set 
the sustained performance of the Earth Simulator for a high resolution global AGCM 
to be more than 5 TFLOPS. 

Reviewing the trends of commercial parallel computers, we can consider two types 
of parallel architectures for the Earth Simulator; one is a distributed parallel system 
with cache-based microprocessors and the other is a system with vector processors. 
According to the performance evaluation for a well-known AGCM (CCM2), it is 
shown that the efficiency is less than 7% on cache-based parallel systems, where the 
efficiency is the ratio of the sustained performance to the theoretical peak. On the 
other hand, an efficiency about 30% was obtained on parallel systems with vector 
processors [1]. For this reason, we decided to employ a distributed parallel system 
with vector processors. 

Another key issue for a parallel system is the interconnection network. As 
mentioned above, many different types of applications will run on the Earth 
Simulator. Judging from the flexibility of parallelism for many different types of 
applications, we employ a single-stage crossbar network in order to make the system 
completely fiat. 
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PN639 



APO 



AP7 



Single-Stage Crossbar Network 
f : Fast Switch 16 GB/S 



PN ; Processor Node (multi-processor node) 
AP : Arithmetic Processor 8 GFLOPS (peak) 
MM : Main Memory (sheared) 16 GB 
RCU: Remote Control Unit 
MNU : Memory Network Unit 



Fig. 1 Hardware system of the Earth Simulator 

An outline of the hardware system of the Earth Simulator which is shown in Fig. 1 
can be summarized as follows: 

• Architecture: MIMD-type distributed memory parallel system cousistiug of 

computiug uodes with shared memory vector type multi- 
processors. 

• Performauce: Assumiug the efficieucy 12.5%, the peak performauce 40 

TFLOPS (the effective performauce for au AGCM is more thau 
5 TFLOPS). 



Total uumber of processor uodes 


640 




Number of PE’s for each uode 


8 




Total uumber of PE’s 


5120 




Peak performauce of each PE 


8 


GFLOPS 


Peak performauce of each uode 


64 


GFLOPS 


Maiu memory: 


10 


TB (total). 


Shared memory / uode: 


16 


GB 



lutercouuectiou uetwork: 



Siugle-Stage Crossbar Network 
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3 Outline of the Basic Software 

It is anticipated that the operation of the huge system described above might be very 
difficult. In order to relax the difficulty, 640 nodes are divided into 40 groups which 
are called “clusters”. Each cluster consists of 16 nodes, a CCS (Cluster Control 
Station), an IOCS (I/O Control Station) and system disks as shown in Fig.2. 

In the Earth Simulator system, we employ parallel I/O techniques for both disk and 
mass-storage systems to enhance data throughputs. That is, each node has an I/O 
processor to access to a disk system via fast network and each cluster is connected to 
a drive of the mass-storage system via IOCS. Each cluster is controlled by the CCS 
and the total system is controlled by the SCCS (Super Cluster Control Station). The 
Earth Simulator is basically a badge-job system and most of the clusters are called 
badge-job clusters. However, we are planning to prepare a very special cluster, a TSS 
cluster. In the TSS cluster, one or two nodes will be used for TSS jobs and rest of the 
nodes for small scale (single node) badge jobs. Note that user disks are connected 
only to the TSS cluster to save the budget for peripheral devices. Therefore, most of 
user files which will be used on the badge-job clusters are to be stored in the mass 
storage system. 



LAN Switch (46 ports) 
ATM. LE/FC 




WS : Wcwk Station 

hC : hiber Channel 

LE : Link Fncapaubtion 

CTL : Oartric^ Tape library System 

(X)S : Cluster Control Station 

SCX>S : Super Cluster Control Station 

IOCS ; I/O Control Station 



0 oL 

MaRR Storage System (CTL) 



Fig.2 Clusters and peripheral devices of the Earth Simulator 

We basically use an operating system which will be developed by the vender 
(NEC) for their future system. However, some high level functions characteristic to 
the Earth Simulator will be newly developed such as: 

large scalability of distributed-memory parallel processing, 
total system control by SCCS , 
large scale parallel I/O, 

interface for center routine (operation support software), etc.. 
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Concerning parallel programming languages, automatic parallelization, OpenMP 
and micro-tasking will be prepared as standard languages for shared-memory 
parallelization, and both MPI2 and HPF2 for distributed memory parallelization. As 
described above, the Earth Simulator has a memory hierarchy. If a user wants to 
optimize his code on the system, he must take the memory hierarchy into 
consideration as shown in Fig. 3. 



Memory Hierarchy 
Total system 
Node 




Program Optimization 



DO j 

Automatic parallelization, 
OpenW^ 

Vectorization : 
END DO 



Parallel Programming 

(1) Shared memory parallelbatbn ; 

Automatic parallelization, 
OpenM^ 

(2) Distributed memory parallel^tbn ; 

M^I2 ^^sage Passing Interfece V2X 
HPF2 (High Performance Fortran \^) 



HPF for the Earth Simulator : JAHPF +0t 




Extended JAHPF • Communicatbn pattern reuse. 

* Asynchronous communicatbn, 

* User controllabfe shadoH, 

* Reduction kmds. 

* Paralelcatkin of bops 
mcbdne procedure calls, 
etc. 



‘Indirect distribution 



‘Asynchronous I/O 
V^Approved Extensbn^ 



JAHPF: 

Japan Associatbn for HPF 



Fig. 3 Memory hierarchy and program optimization 

As part of the development of basic software system, we are developing an 
operation supporting software system what is called a “center routine”. We are going 
to use an archival system as a main storage of user files. It usually takes very long 
time to transfer user fdes from the mass-storage system to the system disks. 
Therefore, the most important function of the center routine is the optimal scheduling 
of not only submitted batch jobs but also user files necessary for the jobs. Because 
the Earth Simulator is a special machine for numerical Earth science, our job 
scheduler has following important features: 

higher priority to job turn-round rather than total job throughput, 

- prospective reservation of batch-job nodes. Jobs are controlled by wall-clock 
time. 

automatic recall and migration of user files between system disks and mass- 
storage system. 




38 Keiji Tani 



4 Outline of the Application Software 

Also, we are developing application programs optimized for the architecture of the 
Earth Simulator. A typical AGCM can be divided into two parts, a dynamical core 
and a set of physical models. We employ a standard plug-in interface to combine the 
physical models with the dynamical core. With this system, users can easily change 
any old physical models with their new ones and compare the results. A basic concept 
of the plug-in interface is shown in Fig.4. 




Platform #n Dynamics (parallelization, numerical method, coordinates, boundary conditions) 



Fig. 4 Concept of plug-in interface 

Taking the idea of the plug-in interface, we are developing two application 
software systems which will be opened and used by many users as standard models 
on the Earth Simulator system, an atmosphere-ocean coupled model and a large scale 
FEM model for solid Earth science which is called “GeoFEM”. The configuration of 
the atmosphere-ocean coupled model is shown in Fig. 5 (a). Preliminary results on 
the 1-day averaged global distribution of precipitation are also shown in Fig. 5 (b). 
The results have been obtained by averaging over the results at 4* and 5* years with 
an initial condition of static atmosphere. Also, the configuration of the GeoFEM and 
preliminary results on the crust/mantle activity in the Japan Archipelago region are 
shown in Fig. 6 (a) and (b), respectively [2]. 
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Atomospheric GCM 



Dynamics CCSR AGCM54 



• Governing equation : 

3D-primitive equation 

• Coordinate ^stem ; 

Spherical, O’ -coordinate 

• Discretization : 

Spectral in horizontal, 
Arakawa & Suarez in vertical 

• Time integration ; 

Semi-implicit Leap-frog 



Plug-in 



► Cumulus convection : 

Arakawa & Schubert, 

cloud model ;entrainment-plume 
type 

* Large scale condensation : 

Le Treut 8t Li 

► FRadiation : DOM adding, 
K-distribution method 

► Vertical diffusion ; 

Turbulent closure level 2 




Ocean GCM 



Dynemics ; POM 



► Governing equation ; 

3D-primitive equation 

► Coordinate system ; 

Orthogonal curvilinear. 

O’ -coordinate 

► Sea surface ; free surface 

► Discretization : Arakawa C-grid 

► Time integration ; split-explicit 

time step. Leap-frog 



Plug~in 



Physics CCSRAGCM5 4 



' Boundary conditions ; 
sea surface : wind stress, 
heat flux, evaporation 
sea bottom : friction 
» Heat flux ; sensible heat, 
latent heat solar radiation 
» Horizontal diffusion : 

Smagorinsky 
► Vertical diffusion ; 
Turbulent closure level 2 



Physics 
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Fig. 5 Configuration of an Earth Simulator standard atmosphere and ocean GCM (a) and 
preliminary results on 1-day averaged global distribution of precipitation (b) 
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Fig. 6 Configuration of GeoFEM (a) and preliminary results on preliminary results on the 
crust/mantle activity in the Japan Archipelago region (b) 



5 Schedule 

All the design and R&D works for both hardware and basic software systems, the 
conceptual design, basic design, the design of parts and packaging, the R&D for parts 
and packaging , and the detailed design, have been completed during the last three 
fiscal years, FY97,98 and 99. The manufacture of the hardware system is now 
underway. The development of the center routine is also underway. Facilities 
necessary for the Earth Simulator including buildings are also under construction. 
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The bird’s-eye view of the Earth Simulator building is shown in Fig. 7 The total 
system will be completed in the spring of 2002. 

An outline of the schedule is shown in Table 1. 




Fig. 7 Bird’s-eye view of the Earth Simulator building 



Table 1 Outline of schedule 
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Abstract. Traditional parallel compilers do not effectively parallelize ir- 
regular applications because they contain little loop-level parallelism. We 
explore Speculative Task Parallelism (STP), where tasks are full proce- 
dures and entire natural loops. Through profiling and compiler analysis, 
we find tasks that are speculatively memory- and control-independent 
of their neighboring code. Via speculative futures, these tasks may be 
executed in parallel with preceding code when there is a high probability 
of independence. We estimate the amount of STP in irregular appli- 
cations by measuring the number of memory-independent instructions 
these tasks expose. We find that 7 to 22% of dynamic instructions are 
within memory-independent tasks, depending on assumptions. 



1 Introduction 

Today’s microprocessors rely heavily on instruction-level parallelism (ILP) to 
gain higher performance. Flow control imposes a limit to available ILP in single- 
threaded applications [8] . One way to overcome this limit is to find parallel tasks 
and employ multiple flows of control (threads). Task-level parallelism (TLP) 
arises when a task is independent of its neighboring code. We focus on finding 
these independent tasks and exploring the resulting performance gains. 

Traditional parallel compilers exploit one variety of TLP, loop level paral- 
lelism (LLP), where loop iterations are executed in parallel. LLP can overwhelm- 
ing be found in numeric, typically FORTRAN programs with regular patterns of 
data accesses. In contrast, general purpose integer applications, which account 
for the majority of codes currently run on microprocessors, exhibit little LLP as 
they tend to access data in irregular patterns through pointers. Without pointer 
disambiguation to analyze data access dependences, traditional parallel compil- 
ers cannot parallelize these irregular applications and ensure correct execution. 

In this paper we explore task-level parallelism in irregular applications by 
focusing on Speculative Task Parallelism (STP), where tasks are speculatively 
executed in parallel under the following assumptions: 1) tasks are full proce- 
dures or entire natural loops, 2) tasks are speculatively memory-independent 
and control-independent, and 3) our architecture allows the parallelization of 
tasks via speculative futures (discussed below). Figure 1 illustrates STP, show- 
ing a dynamic instruction stream where a task Y has no memory access conflicts 
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Fig. 1. STP example: (a) shows a section of code where the task Y is known to be 
memory-independent of the preceding code X. (b) the shaded region shows memory- 
and control-independent instructions that are essentially removed from the critical path 
when Y is executed in parallel with X. (c) when task Y is longer than X. 



with a group of instructions, X, that precede Y. The shorter of X and Y deter- 
mines the overlap of memory-independent instructions as seen in Figures 1(b) 
and 1(c). In the absence of any register dependences, X and Y may be executed 
in parallel, resulting in shorter execution time. It is hard for traditional parallel 
compilers of pointer-based languages to expose this parallelism. 

The goals of this paper are to identify such regions within irregular applica- 
tions and to find the number of instructions that may thus be removed from the 
critical path. This number represents the maximum possible STP. To facilitate 
our discussion, we offer the following definitions. 

A task, exemplified by Y in Figure 1, is a bounded set of instructions in- 
herent to the application. Two sections of code are memory-independent when 
neither contains a store to a memory location that the other accesses. When 
all load/store combinations of the type [load, store], [store,load] and [store, store] 
between two tasks, X and Y, access different memory locations, X and Y are said 
to be memory-independent. A launeh point is the point in the code preceding a 
task where the task may be initiated in parallel with the preceding code. This 
point is determined through profiling and compiler analysis. A launehed task is 
one that begins execution from an associated launch point on a different thread. 

Because the biggest barrier to detecting independence in irregular codes is 
memory disambiguation, we identify memory-independent tasks using a profile- 
based approach and measure the amount of STP by estimating the amount of 
memory-independent instructions those tasks expose. As successive executions 
may differ from the profiled execution, any launched task would be inherently 
speculative. One way of launching a task in parallel with its preceding code is 
through a parallel language construct called a future. A future conceptually forks 
a thread to execute the task and identifies an area of memory in which to relay 
status and results. When the original thread needs the results of the futured 
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task, it either waits on the futured task thread, or in the case that the task was 
never futured due to no idle threads, it executes the futured task itself. 

To exploit STP, we assume a speculative machine that supports specula- 
tive futures. Such a processor could speculatively execute code in parallel when 
there is a high probability of independence, but no guarantee. Our work iden- 
tifies launch points for this speculative machine, and estimates the parallelism 
available to such a machine. With varying levels of control and memory spec- 
ulation, 7 to 22% of dynamic instructions are within tasks that are found to 
be memory-independent, on a set of irregular applications for which traditional 
methods of parallelization are ineffective. 

In the next section we discuss related work. Section 3 contains a descrip- 
tion of how we identify and quantify STP. Section 4 describes our experiment 
methodology and Section 5 continues with some results. Implementation issues 
are highlighted in Section 6, followed by a summary in Section 7. 



2 Related Work 



In order to exploit Speculative Task Parallelism, a system would minimally need 
to include multiple flows of control and memory disambiguation to aid in mis- 
speculation detection. Current proposed structures that aid in dynamic memory 
disambiguation are implemented in hardware alone [3] or rely upon a compiler [5, 
4] . All minimally allow loads to be speculatively executed above stores and detect 
write-after-read violations that may result from such speculation. 

Some multithreaded machines [21,19,2] and single-chip multiprocessors [6, 
7] facilitate multiple flows of control from a single program, where flows are gen- 
erated by compiler and/or dynamically. All of these architectures could exploit 
non-speculative TLP if the compiler exposed it, but only Hydra [6] could support 
STP without alteration. 

Our paper examines speculatively parallel tasks in non-traditionally paral- 
lel applications. Other proposed systems, displaying a variety of characteristics, 
also use speculation to increase parallelism. They include Multiscalar proces- 
sors [16, 12,20], Block Structured Architecture [9], Speculative Thread-level Par- 
allelism [15,14], Thread-level Data Speculation [18], Dynamic Multithreading 
Processor [1], and Data Speculative Multithreaded hardware architecture [11, 
10 ]. 

In these systems, the type of speculative tasks include fixed-size blocks [9], 
one or more basic blocks [16], dynamic instruction sequences [18], loop itera- 
tions [15,11], instructions following a loop [1], or following a procedure call [1, 
14]. These tasks were identified dynamically at run-time [11,1], statically by 
compilers [20,9,14], or by hand [18]. The underlying architectures include tra- 
ditional multiprocessors [15,18], non-traditional multiprocessors [16,9,10], and 
multithreaded processors [1,11]. 

Memory disambiguation and mis-speculation detection was handled by an 
Address Resolution Buffer [16], the Time Warp mechanism of time stamping 




46 



Barbara Kreaseck et al. 



requests to memory [9], extended cache coherence schemes [14,18], fully asso- 
ciative queues [1], and iteration tables [11]. Control mis-speculation was always 
handled by squashing the mis-speculated task and any of its dependents. While 
a few handled data mis-speculations by squashing, one rolls back speculative ex- 
ecution to the wrong data speculation [14] and others allow selective, dependent 
re-execution of the wrong data speculation [9, 1]. 

Most systems facilitate data flow by forwarding values produced by one 
thread to any consuming threads [16, 9, 18, 1, 11]. A few avoid data mis-specula- 
tion through synchronization [12, 14]. Some systems enable speculation by value 
prediction using last-value [1, 14, 11] and stride-value predictors [14, 11]. 

STP identifies a source of parallelism that is complimentary to that found by 
most of the systems above. Armed with a speculative future mechanism, these 
systems may benefit from exploiting STP. 

3 Finding Task-based Parallelism 

We find Speculative Task Parallelism by identifying all tasks that are memory- 
independent of the code that precedes the task. This is done through profiling 
and compiler analysis, collecting data from memory access conflicts and control 
flow information. These conflicts determine proposed launch points that mark 
the memory dependences of a task. Then for each task, we traverse the control 
flow graph (CFG) in reverse control flow order to determine launch points based 
upon memory and control dependences. Finally, we estimate the parallelism 
expected from launching the tasks early. The following explain the details of our 
approach to finding STP. 

Task Selection The type of task chosen for speculative execution directly af- 
fects the amount of speculative parallelism found in an application. Oplinger, 
et. al. [14], found that loop iterations alone were insufficient to make speculative 
thread-level parallelism effective for most programs. To find STP, we look at 
three types of tasks: leaf procedures (procedures that do not call any other pro- 
cedure), non-leaf procedures, and entire natural loops. When profiling a combina- 
tion of task types, we profile them concurrently, exposing memory-independent 
instructions within an environment of interacting tasks. 

Although all tasks of the chosen type(s) are profiled, only those that expose 
at least a minimum number of memory-independent instructions are chosen to 
be launched early. The final task selection is made after evaluating memory and 
control dependences to determine actual launch points. 

Memory Access Conflicts Memory access conflicts are used to determine the 
memory dependences of a task. They occur when two load/store instructions 
access the same memory region. Only a subset of memory conflicts that occur 
during execution are useful for calculating launch points. Useful conflicts span 
task boundaries and are of the form [load, store] , [store, load] , or [store, store] . We 
also disregard stores or loads due to register saves and restores across procedure 
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Fig. 2. PLP Locations: conflict source, conflict destination, PLP candidate, PLP over- 
lap, and task overlap, when the latest conflict source is (a) before the calling routine, 
(b) within a sibling task (c) within the calling routine. 



calls. We call the conflicting instruction preceding the task the conflict source, 
and the conflicting instruction within the task is called the conflict destination. 
Specifically, when the conflict destination is a load, the conflict source will be 
the last store to that memory region that occurred outside the task. When the 
conflict destination is a store, the conflict source will be the last load or store to 
that memory region that occurred outside the task. 

Proposed Launch Points The memory dependences for a task are marked, 
via profiling, as proposed launch points (PLPs). A PLP represents the memory 
access conflict with the latest (closest) conflict source in the dynamic code pre- 
ceding one execution of that task. Exactly one PLP is found for each dynamic 
task execution. In our approach, launch points for a task occur only within the 
task’s calling region, limiting the amount and scope of executable changes that 
would be needed to exploit STP. Thus, PLPs must also lie within a task’s calling 
routine. 

Figure 2 contains an example that demonstrates the latest conflict sources 
and their associated PLPs. Task Z calls tasks X and Y. Y is the currently exe- 
cuting task in the example and Z is its calling routine. When the conflict source 
occurs before the beginning of the calling routine, as in Figure 2(a), the PLP is 
directly before the first instruction of the task Z. When the conflict source occurs 
within a sibling task or its child tasks, as in Figure 2(b), the PLP immediately 
follows the call to the sibling task. In Figure 2(c), the conflict source is a calling 
routine instruction and the PLP immediately follows the conflict source. 

Two measures of memory-independence are associated with each PLP. They 
are the PLP overlap and the task overlap, as seen in Figure 2. The PLP overlap 
represents the number of dynamic instructions found between the PLP and the 
beginning of the task. The task overlap represents the number of dynamic in- 
structions between the beginning of the task and the conflict destination. With 
PLPs determined by memory dependences that are dynamically closest to the 
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(a) 0) (c) (d) (e) 

Fig. 3. Task Launch Points: Dotted areas have not been fully visited by the back-trace 
for task Y. (a) CFG block contains a PLP. (b) CFG block is calling routine head, (c) 
Loop contains a PLP. (d) Incompletely visited CFG block, (e) Incompletely visited loop. 

task call site, and by recording the smallest task overlap, we only consider con- 
servative, safe locations with respect to the profiling dataset. 

Task Launch Points Both memory dependences and control dependences in- 
fluence the placement of task launch points. Our initial approach to exposing 
STP determines task launch points that provide two guarantees. First, static 
control dependence is preserved: all paths from a task launch point lead to the 
original task call site. Second, profiled memory dependence is preserved: should 
the threaded program be executed on the profiling dataset, all instructions be- 
tween the task launch point and the originally scheduled task call site will be 
free of memory conflicts. Variations which relax these guarantees are described 
in Section 5.2. 

For each task, we recursively traverse the CFG in reverse control flow order 
starting from the original call site, navigating conditionals and loops, to identify 
task launch points. We use two auxiliary structures: a stalled block list to hold 
incompletely visited blocks, and a stalled loop list to hold incompletely visited 
loops. There are five conditions under which we record a task launch point. These 
conditions are described below. The first three will halt recursive back-tracing 
along the current path. As illustrated in Figure 3, we record task launch points: 

a. when the current CFG block contains a PLP for that task. The task launch 
point is the last PLP in the block. 

b. when the current CFG block is the head of the task’s calling routine and 
contains no PLPs. The task launch point is the first instruction in the block. 

c. when the current loop contains a PLP for that task. Back-tracing will only 
get to this point when it visits a loop, and all loop exit edges have been 
visited. As this loop is really a sibling of the current task, task launch points 
are recorded at the end of all loop exit edges. 

d. for blocks that remain on the stalled block list after all recursive back-tracing 
has exited. A task launch point is recorded only at the end of each visited 
successor edge of the stalled block. 

e. for loops that remain on the stalled loop list after all recursive back-tracing 
has exited. A task launch point is recorded only at the end of each visited 
loop exit edge. 
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Each task launch point indicates a position in the executable in which to 
place a task future. At each task’s original call site, a check on the status of the 
future will indicate whether to execute the task serially or wait on the result of 
the future. 

Parallelization Estimation We estimate the number of memory-independent 
instructions that would have been exposed had the tasks been executed at their 
launch points during the profile run. Our approach ensures that each instruction 
is counted as memory-independent at most once. When the potential for instruc- 
tion overlap exceeds the task selection threshold the task is marked for STP. We 
use the total number of claimed memory-independent instructions as an estimate 
of the limit of STP available on our hypothetical speculative machine. 

4 Methodology 

To investigate STP, we used the ATOM profiling tools [17] and identified natural 
loops as defined by Muchnick [13]. We profiled the SPECint95 suite of benchmark 
programs. Each benchmark was profiled for 650 million instructions. We used the 
reference datasets on all benchmarks except compress. Eor compress, we used a 
smaller dataset, in order to profile a more interesting portion of the application. 

We measure STP by the number of memory-independent task instructions 
that would overlap preceding non-task instructions should a selected task be 
launched (as a percentage of all dynamic instructions). 

The task selection threshold comprises two values, both of which must be 
exceeded. Eor all runs, the task selection threshold was set at 25 memory- 
independent instructions per task execution and a total of 0.2% of instructions 
executed. We impose this threshold to compensate for the expected overhead of 
managing speculative threads and to enable allocation of limited resources to 
tasks exposing more STP. 

Our results show a limit to STP exposed by the launched execution of 
memory-independent tasks. No changes, such as code motion, were made or as- 
sumed to have been made to the original benchmark codes that would heighten 
the amount of memory-independent instructions. Overhead due to thread cre- 
ation, along with wakeup and commit, will be implementation dependent, and 
thus is not accounted for. Register dependences between the preceding code 
and the launched task were ignored. Therefore, we show an upper bound to the 
amount of STP in irregular applications. 

5 Results 

We investigated the amount of Speculative Task Parallelism under a variety of 
assumptions about task types, memory conflict granularity, control and memory 
dependences. Our starting configuration includes profiling at the page-level (that 
is, conflicts are memory accesses to the same page) with no explicit speculation 
and is thus our most conservative measurement. 
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Fig. 4. Task Types: Individual data points identify memory independent instructions 
as a percentage of all instructions profiled and represent our starting configuration. 
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5.1 Task Type 

We first look to see which task types exhibit the most STP. We then explore var- 
ious explicit speculation opportunities to find additional sources of STP. Finally, 
we investigate any additional parallelism that might be exposed by profiling at 
a finer memory granularity. 

We investigate leaf procedures, non-leaf procedures, and entire natural loops. 
We profiled these three task types to see if any one task type exhibited more STP 
than the others. Figure 4 shows that, on average, a total of 7.3% of the profiled 
instructions were identified as memory independent, with task type contributions 
differing by less than 1% of the profiled instructions. This strongly suggests 
that all task types should be considered for exploiting STP. The succeeding 
experiments include all three task types in their profiles. 



5.2 Explicit Speculation 

The starting configuration places launch points conservatively, with no explicit 
control or memory dependence speculation. Because launched tasks will be im- 
plicitly speculative when executed with different datasets, our hypothetical ma- 
chine must already support speculation and recovery. We explore the level of 
STP exhibited by explicit speculation, first, by speculating on control depen- 
dence, where the launched task may not actually be needed. Next, we speculate 
on memory dependence, where the launched task may not always be memory- 
independent of the preceding code. Finally, we speculate on both memory and 
control dependences by exploiting the memory-independence of the instructions 
within the task overlap. 

Our starting configuration determines task launch points that preserve static 
control dependences, such that all paths from the task’s launch points lead 
to the original task call site. Thus, a task that is statically control dependent 
upon a condition whose outcome is constant, or almost constant, throughout 
the profile, will not be selected, even though launching this task would lead to 
almost no mis-speculations. We considered two additional control dependence 
schemes that would be able to exploit the memory-independence of this task. 

Profiled control dependences exclude any static control dependences that 
are based upon branch paths that are never traversed. When task launch points 
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Fig. 5. Control Dependence Speculation: Using the profile information that the condi- 
tions are true 90% , 90% , and 100% , respectively, profiled control dependence and 
speculative control dependence allow task A to be launched outside of the inner if state- 
ment. The corresponding CFG displays the launch points as placed by each type of 
control dependence. The edge frequencies reflect that the code was executed 100 times. 



preserve profiled control dependences, all traversed paths from the launch points 
lead to the original call site. 

When task launch points preserve speculative control dependences, all 
frequently traversed paths from the launch points lead to the original call site. 
The amount of speculation is controlled by setting a minimum frequency per- 
centage, c. For example, when c is set to 90, then at least 90% of the traversed 
paths from the launch points must lead to the original call site. 

In Figure 5, the call statement of task A is statically control dependent 
on all three if-statements. The corresponding CFG in Figure 5 highlights the 
launch points as determined by the three control dependence options. All paths 
beginning with block H, all traversed paths beginning with block E, and 90% of 
the traversed paths beginning with block C lead to the call of task A. Therefore, 
the static control dependence launch point is before block H, the profiled control 
dependence launch point is before block E, and with c set to 90, the speculative 
control dependence launch point is before block C. 

The price of using speculative control dependences will be the waste of re- 
sources used to speculatively initiate a launched task when the executed path 
does not lead to the task call site. These extra launches can be squashed at the 
first mis-speculated conditional. 

The first three bars per benchmark in Eigure 6 show the effect of control 
dependence speculation. The bars display static control dependence, profiled 
control dependence, and speculative control dependence at c = 90, respectively. 
On average, profiled control dependence exposed an additional 1.3% of dynamic 
instructions as memory-independent, while speculative control dependence only 
exposed an additional 0.6% over profiled. 

The choice of using profiled or speculative control dependence will be in- 
fluenced by the underlying architecture, and the degree to which speculative 
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Fig. 6. Memory-independent instructions reported as a percentage of all instructions 
profiled. Page = Page-level profiling, Word = Word-level profiling, ProfControl = Pro- 
filed control dependence, SpecControl = Speculative control dependence, SpecMemory 
= Speculative memory dependence. Synch = Early start with synchronization. 



threads compete with non-speculative for resources. Further results in this pa- 
per use profiled control dependence, due to the low gain from speculative control 
dependence. 

Memory dependence provides another opportunity for explicit speculation. 
Our starting configuration determines launch points that preserve profiled 
memory dependences such that all instructions between each launch point 
and its original task call site are memory-independent of the task. This approach 
results in a conservative, but still speculative, placement of launch points. 

We also consider the less conservative approach of determining task launch 
points by speculative memory dependences, which ignores profiled memory 
conflicts that occur infrequently. The amount of speculation is controlled by 
setting a minimum frequency percentage, m. For example, when m is set to 90, 
then at least 90% of the traversed paths from the task launch points to the 
original call site must be memory-independent of the task. Using speculative 
memory dependences is especially attractive when PLPs are far apart, and the 
ones nearest the task call site seldom cause a memory conflict. 

We examine the effect of task launch points that preserve speculative mem- 
ory dependence at m = 90 (the fourth bar in Figure 6). Speculative memory 
dependence provides small increases in parallelism. Despite the small gains, we 
include task launch points determined by speculative memory dependence for 
the remaining results. 

By placing launch points (futures) at control dependences or memory de- 
pendences (PLPs), we have used the limited synchronization inherent within 
futures to synchronize these dependences with the beginning of the speculative 
task. This limits the amount of STP that we have been able to expose to the 
number of dynamic instructions between the launch point and the original task 
call site, which we call the LP overlap. The instructions represented by the task 
overlap, between the beginning of the speculative task and the earliest profiled 
conflict destination, are profiled memory-independent of all of the preceding 
code. By using explicit additional synchronization around the earliest profiled 
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Fig. 7. Synchronization Points: Gray areas represent memory-independent instruc- 
tions. (a) a task with a large amount of task overlap on a serial thread, (b) When 
the dependence is used as a launch point, only LP overlap contributes to memory- 
independence. (c) By synchronizing the dependence with the earliest conflict destina- 
tion, both LP overlap and task overlap contribute to memory-independence. 



conflict destination, early start with synchronization enables the task overlap to 
contribute to the number of exposed memory-independent instructions. 



Early Start with Synchronization Currently, a task with a large task over- 
lap and a small LP overlap would not be selected as memory-independent, even 
though a large portion of the task is memory-independent with its preceding 
code. By synchronizing the control or memory dependence with the earliest con- 
flict destination, the task may be launched earlier than the dependence. Where 
possible, we placed the task launch point above the dependence a distance equal 
to the task overlap. Any control dependences between the new task launch point 
and the synchronization point would be handled as speculative control depen- 
dences. 

Figure 7 illustrates synchronization points. When the dependence determines 
a launch point, in Figure 7(b), all memory-independent instructions come from 
the LP overlap. Figure 7(c) shows that by synchronizing the dependence with the 
earliest conflict destination, both the LP overlap and the task overlap contribute 
to the number of memory-independent instructions. 

Early start shows the greatest increase in parallelism so far, exposing on the 
average an additional 6.6% of dynamic instructions as memory-independent (the 
fifth bar per benchmark of Figure 6) . The big increase in parallelism came from 
tasks that had not previously exhibited a significant level of STP, but now are 
able to exceed our thresholds. 

The extra parallelism exposed through early start will come at the cost of 
additional dynamic instructions and the cost of explicit synchronization. We did 
not impose any penalties to simulate those costs as they will be architecture- 
dependent. 
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Fig. 8. Conservative vs. Aggressive Speculation: Conservative is page-level profiling on 
all task types with static control dependence and profiled memory dependence. Aggres- 
sive is word-level profiling on all task types with profiled control dependence, speculative 
memory dependence, and early start. 



5.3 Memory Granularity 

We define a memory access conflict to occur with two accesses to the same 
memory region. The memory granularity (the size of these regions) effects the 
amount of parallelism that is exposed. Reasonable granularities are full bytes, 
words, cache-lines, or pages. When a larger memory granularity is used, this 
may result in a conservative placement of launch points. The actual granularity 
used will depend on the granularity at which the processor can detect memory 
ordering violations. Managing proflled-parallel tasks whose launch points were 
determined with a memory granularity of a page would allow the use of existing 
page protection mechanisms to detect and recover from dependence violations. 
Thus, our starting configuration used page-level profiling. We also investigate 
word-level profiling. 

In Figure 6, the last bar shows the results of word-level profiling on top of 
profiled control dependence, speculative memory dependence and early start. 
The average gain in memory-independence across all benchmarks was about 6% 
of dynamic instructions. 

5.4 Experiment Summary 

Figure 8 re-displays the extremes of our STP results from conservative to ag- 
gressive speculation broken down by task types. The conservative configuration 
includes page-level profiling on all task types with static control dependence and 
profiled memory dependence. The aggressive configuration comprises word-level 
profiling on all task types with profiled control dependence, speculative mem- 
ory dependence, and early start. M88ksim showed the largest increase in the 
percentage of memory-independent instructions at over 28%, with vortex very 
close at over 25%, and the average across benchmarks at about 14%. Each of 
these increases in parallelism were largely seen in the non-leaf procedures. Ijpeg 
was the only benchmark to see a sizable increase contributed by leaf procedures. 
Loops accounted for increases in gcc, go, li and perl. 
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Table 1. Task Statistics (Conservative vs. Aggressive) 



Table 1 displays statistics from the conservative and aggressive speculation of 
those tasks which exceed our thresholds. The average overlap is that part of the 
average task length that can be overlapped with other execution. The number 
of tasks selected for STP is greatly affected by aggressive speculation. 

In this Section, early start with synchronization provided the highest single 
increase among all alternatives. Speculative systems with fast synchronization 
should be able to exploit STP the most effectively. Our results also indicate that 
a low-overhead word-level scheme to exploit STP would be profitable. 



6 Implementation Issues 

For our limit study of Speculative Task Parallelism, we have assumed a hypo- 
thetical speculative machine that supports speculative futures with mechanisms 
for resolving incorrect speculation. When implementing this machine, a number 
of issues need to be addressed. 

Speculative Thread Management Any system that exploits STP would 
need to include instructions for initialization, synchronization, communication 
and termination of threads. As launched tasks may be speculative, any imple- 
mentation would need to handle mis-speculations. 

Managing speculative tasks would include detecting load/store conflicts be- 
tween the preceding code and the launched task, buffering stores in the launched 
task, and checking for memory-independence before committing the buffered 
stores to memory. One conflict detection model includes tracking the load and 
store addresses in both the preceding code and the launched task. The amount 
of memory-independence accommodated by this model will be determined by 
the size and access of load-store address storage, and the conflict granularity. 

Another conflict detection model uses a system’s page-fault mechanism. When 
static analysis can determine the page access pattern of the preceding code, the 
launched task is given restricted access to those pages, while the preceding code 
is given access to only those pages. Any page access violation would cause the 
speculative task to fail. 
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Inter-thread Communication Any implementation that exploits STP will 
benefit from a system with fast communication between threads. At the mini- 
mum, inter-thread communication is needed at the end of a launched task and 
when the task results are used. Fast communication would be needed to enable 
early start with synchronization. The ability to quickly communicate a mis- 
speculation would reduce the number of instructions that are issued but never 
committed. This is especially important for systems where threads compete for 
the same resources. 

Adaptive STP We select tasks that exhibit STP based upon memory access 
profiling and compiler analysis. The memory access pattern from one dataset 
may or may not be a good predictor for another dataset. Two feedback oppor- 
tunities arise that allow the execution of another data set to adapt to differences 
from the profiled dataset. The first would monitor the success rate of particular 
launch points, and suspend further launches when it fails too frequently. 

The second feedback opportunity is found in continuous profiling. Rather 
than have a single dataset dictate the launched tasks for all subsequent runs, 
let datasets from all previous runs dictate the launched tasks for the current 
run. It is possible that the aggregate information from the preceding runs would 
have a better predictive relationship with future runs. Additionally, the profiled 
information from the current run could be used to supersede the profiled infor- 
mation from previous runs, with the idea that the current run may be its own 
best predictor. Although profiling is expensive and must be optimized, the exact 
cost is beyond the scope of this paper. 

7 Summary 

Traditional parallel compilers do not effectively parallelize irregular applications 
because they contain little loop-level parallelism due to ambiguous memory ref- 
erences. A different source of parallelism, namely Speculative Task Parallelism 
arises when a task (either a leaf-procedure, a non-leaf procedure or an entire 
loop) is control- and memory-independent of its preceding code, and thus could 
be executed in parallel. To exploit STP, we assume a speculative machine that 
supports speculative futures (a parallel programming construct that executes a 
task early on a different thread or processor) with mechanisms for resolving in- 
correct speculation when the task is not, after all, independent. This allows us to 
speculatively parallelize code when there is a high probability of independence, 
but no guarantee. 

Through profiling and compiler analysis, we find memory-independent tasks 
that have no memory conflicts with their preceding code, and thus could be 
speculatively executed in parallel. We estimate the amount of STP in an irregular 
application by measuring the number of memory-independent instructions these 
tasks expose. We vary the level of control dependence and memory dependence 
to investigate their effect on the amount of memory-independence we found. 
We profile at different memory granularities and introduced synchronization to 
expose higher levels of memory-independence. 
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We find that no one task type exposes significantly more memory-independent 
instructions, which strongly suggests that all three task types should be profiled 
for STP. We also find that starting a task early with synchronization around 
dependences exposes the highest additional amount of memory-independent in- 
structions, an average across the SPECint95 benchmarks of 6.6% of profiled 
instructions. Profiling memory conflicts at the word-level shows a similar gain 
in comparison to page-level profiling. Speculating beyond profiled memory and 
static control dependences shows the lowest gain which is modest at best. Over- 
all, we find that 7 to 22% of instructions are within memory-independent tasks. 
The lower amount reflects tasks launched in parallel from the least speculative 
locations. 
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Abstract. Simultaneous multithreading (SMT) processors achieve high 
performance by executing independent instructions from different programs 
simultaneously [1]. However, the SMT model doesn’t help single-thread 
applications and performs at its full potential only when executing 
multithreaded applications or multiple programs. Moreover, to minimize the 
total execution time of one selected high priority thread, that thread has to run 
alone. Recently, several speculative multithreading architectures have been 
proposed that exploit far away instruction level parallelism in single-thread 
applications. In particular, the dynamic multithreading or DMT model [2] uses 
hardware mechanisms to fork speculative threads at procedure and loop 
boundaries along the execution path of a single program, and executes these 
threads on a multithreaded processor. In this paper, we explore the performance 
scope of an SMT architecture in which spare thread contexts are used to support 
the DMT execution of procedure and loop threads. We show two significant 
advantages of this approach: (1) it increases processor utilization and total 
execution throughput when few programs are running, and (2) it eliminates or 
reduces the performance degradation of one selected high priority program 
when running simultaneously with other programs, without reducing total SMT 
throughput. 



1 Introduction 

In a very near future, tens of millions of transistors will be available to build single- 
chip microprocessors. The single-thread superscalar model used in most current 
commercial processors may not be able to take full advantage of this opportunity. For 
example a straight approach consisting of increasing the size of already existing 
mechanisms such as memory hierarchy or branch prediction tables will not bring 
spectacular performance gains. Also, inherently sequential tasks such as instruction 
fetch and register renaming, and the high frequency of branches in many typical 
programs will make a direct increase of pipeline width more difficult. 

Simultaneous multithreading is a promising technique to obtain more performance 
from a single-chip processor [1]. SMT integrates several hardware contexts to allow 
the concurrent execution of multiple programs. The threads share most of the 
processor resources and in each cycle, instructions from multiple threads can be 



M. Valero et al. (Eds.): ISHPC 2000, LNCS 1940, pp. 59-72, 2000. 
@ Springer-Verlag Berlin Heidelberg 2000 




60 



Haitham Akkary and Sebastien Hily 



simultaneously issued to the functional units. The availability of independent 
instractions from different threads enables an SMT processor to address several 
bottlenecks of conventional superscalar processors: low ILP of non-scientific 
applications, cache miss latency, branch mispredictions, and waste of issue slots due 
to instruction misalignment and taken branches. On mixed applications from the 
Spec92 benchmarks suite, with 8 threads, a 2.5 throughput gain over a complexity- 
equivalent superscalar processor could be expected [3]. 

SMT, however, relies heavily on the availability of multiple threads in each cycle 
and does not improve performance when only a single-thread application is present 
for execution. This limitation could be overcome with parallelizing compilers that 
automatically create multiple threads from the same program. Unfortunately, many 
non-numeric programs have proven to be difficult to parallelize due to their complex 
control flow and memory access patterns. Recently, there have been several 
promising proposals for speculative multithreaded architectures 
[4,5,6,7,8,9,10,11,12,13]. Some of these architectures [4,5,6,7,10,11] combine thread- 
level data speculation hardware with special compilers to extract parallelism from 
general-purpose programs. Others [2,8,9,13] are completely dynamic and target 
existing single-thread binaries that run on current superscalar processors. These 
dynamic multithreading techniques may be especially interesting if the basic 
architecture provides the flexibility of mixing speculative and simultaneous multiple 
threads, providing performance gain for single-thread as well as multithreaded 
applications. 

In particular, the Dynamic Multithreading (DMT) model [2] improves single 
program execution with out-of-order instruction fetch and issue, and speculative 
execution far ahead in a program. As on an SMT microprocessor, several hardware 
contexts support the simultaneous execution of several threads through a shared 
superscalar pipeline. Unlike traditional SMT, the threads come from a single program 
and are created dynamically by hardware. On a DMT processor, the program is 
dynamically forked at the end of a loop or at a procedure call, and a speculative thread 
is initiated to follow the fall-through path. By this means, [2] shows that the 
performance of Spec95 integer applications could be improved by a factor of up to 
35%, on average. 

In this paper, we explore the performance impact of adding speculative 
multithreading to an SMT model. We evaluate whether there is any headroom left in 
the SMT architecture, with its heavy resource sharing, to support the significant 
amount of speculative execution that can take place on a speculative multithreaded 
architecture [14]. We show that the benefit of speculative multithreading exceeds the 
disadvantage arising from misprediction and execution resource sharing, resulting in 
net increase in total execution throughput of the SMT processor. We also show that 
speculative multithreading speeds up the execution of one high priority program 
running with one other program simultaneously, over the non-speculative SMT 
processor running the same program alone. Our results are presented for a variety of 
instruction fetch configurations and highlight some of the tradeoffs that would face 
the designer of such architecture. In our study, we have used the DMT architecture, 
which we believe represents an extreme case in its speculation penalty, since it relies 
completely on hardware and requires no compiler support for thread selection and 
scheduling. 
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2 Architecture Overview 

Our baseline model of processor is derived from the DMT architecture presented in 
[2] and illustrated in Figure 1. It supports the execution of up to 8 threads. In our case, 
a thread can he a conventional SMT program or can he created dynamically hy 
hardware. Each thread has its own program counter, trace buffer in which speculative 
instructions and their results are stored, load and store queues, and return address 
stack. The threads share the memory hierarchy, the physical register file, the 
functional units, and the branch prediction tables. In Figure 1, the dark shaded boxes 
correspond to duplicated hardware. Depending on the configuration, the hardware 
corresponding to the light shaded boxes can be either duplicated or shared. 




Fig. 1. Processor block diagram 

In each cycle, active threads are competing for fetch and issue resources. A priority 
mechanism is implemented using a round-robin policy. Fetched instmctions are 
decoded, renamed, and put into an instruction buffer where they wait for their 
operands and for the availability of functional units. Decoded instructions are also 
written into the trace buffer attached to their thread. When instructions execute, 
results are written back into the physical register file, as well as into the trace buffer. 

A trace buffer acts as a large speculative instruction window and its primary goal is 
to improve the execution throughput by allowing far ahead speculation. New threads 
are created at loop backward branches and procedure calls. They are executed 
speculatively starting at the instructions following the procedures or the loops. In 
other words, two different threads pursue both the branch or call path and the fall- 
through path simultaneously. Threads are not necessarily created at every opportunity. 
A thread prediction table of 2-bit saturating counters is used to predict which 
opportunities are taken. 

Threads may depend on register or memory values produced by other threads from 
the same program. Waiting for other threads to compute all the input data would 
seriously hamper the run-ahead ability of the architecture. To solve this problem, a 
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simple data speculation mechanism is implemented: a new thread executes using the 
register context from the spawning thread as input. Although some of these register 
values may be modified later, such as procedure return value register, this mechanism 
successfully predicts most of the register input values due to the scope of the variables 
in the caller [2], Loads to memory are speculatively issued as if there were no 
conflicting stores to the same addresses from prior threads. Memory disambiguation 
circuitry in the load queues, and register value matching circuitry in the trace buffer, 
activated before a thread is committed, are used to identify data mispredictions. A 
recovery sequence is initiated when a misprediction is detected. 

During recovery, instructions affected by data mispredictions are sequentially 
fetched from the trace buffer (instead of the instruction cache), renamed, and issued 
again for repair execution. Instructions that are not affected by data mispredictions do 
not need repair and they are filtered out using a register scoreboard that resides in the 
trace buffer block. This increases the efficiency of repair execution by eliminating 
unnecessary repair. The trace buffer sometimes supplies operands for instruction 
repair. This is the case when an instmction has one of two operands computed 
incorrectly due to data misprediction. The correct operand does not have to be 
recomputed again, so it is attached as a constant operand to the repair instmction. 

A useful characteristic of the trace buffers is that they sit outside the critical 
execution path. This allows far-ahead execution of threads and the storage of many 
more speculative instractions and their results without the inconvenience of the 
conventional superscalar processor model, in which all speculative state is stored in a 
centralized physical register file. Organizing the complete instruction window around 
the trace buffers has two major implications: (1) physical registers are freed 
immediately when all instractions that need them execute, therefore, operands are 
renamed again and new physical registers are allocated at misprediction repair time, 
and (2) results are committed in order from the trace buffer into a special register file, 
after thread inputs from prior threads are checked. Full details of the trace buffers, 
data speculation, and repair algorithm are described in [2]. 

The pipeline used in the baseline model is a generic 7-stage out-of-order pipeline 
that includes fetch, decode, rename, issue, register file read, execute, and retire stages. 



3 Simulation Methodology 

In our simulation model, the instractions can be issued to 10 functional units: 4 
ALUs, 1 multiply/divide unit, 4 floating point add units and 1 floating point 
multiply/divide unit. Two of the four ALUs perform Load/Store address computation. 
All the functional units, except the divide unit, are pipelined. The execution unit 
latencies are listed in Table 1. All threads share 128 physical registers for renaming 
purposes. The trace buffers have 512 entries per thread. 

A description of the configuration of the memory hierarchy chosen for our 
experiments is also given in Table 1. All caches are fully pipelined. The data cache 
and the L2 cache are lock-up free. The load latency to the first level data cache, not 
including the address computation, is 2 cycles. The 2 cycles account for address 
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translation and data cache access, which happen in parallel with the address match 
and possibly store forwarding from the store queues. Memory latency is 30 cycles. 

For branch prediction, we have used a 512-entry 4-way associative BTB and a 4K- 
entry PHT. The PHT is indexed using a 12-bit XOR of a subset of the instruction 
address and the content of a 12-bit branch history register. Each thread has a private 
branch history register and 32-entry return address stack. 

Table 1. Instruction latencies and configuration of the cache hierarchy 



Inst, type 


Latency 


Integer ALU 


1 


Load/Store 


2 


Multiply 


3 


Divide 


20 


EP-ALU 


2 


EP-multiply 


4 


EP-divide 


12 



Cache 


Size 


Assoc. 


Block size 


Latency 


Inst. 


32K 


2-way 


64 bytes 


1 cycle 


Data 


32K 


2-way 


64 bytes 


2 cycles 


L2 


IM 


4-way 


64 bytes 


5 cycles 



The performance simulator is derived from the SimpleScalar tools set [15] and 
supports a modified version of the MIPS-IV ISA. To run our experiments, we used 6 
of the Spec95 benchmarks: four integer applications (gcc, go, vortex, and perl) and 
two floating-points applications (applu and fpppp). These applications were compiled 
using the SimpleScalar version of gcc with level-3 optimization. We have selected 
gcc, go, and perl because their execution throughput is low and their parallelization is 
difficult. Vortex and fpppp are two of the more memory-intensive applications in the 
spec95 benchmarks. 

Our choice of the Spec95 benchmarks suite to evaluate a multithreaded 
architecture may seem arbitrary, since it is not clear that a practical multithreaded 
processor would be running combinations of workloads with similar characteristics to 
Spec95 [16]. Unfortunately, there is no standard set of benchmarks established for 
SMT based architectures. Our results should be viewed as a demonstration of the 
potential of the evaluated architecture model when running concurrently multiple 
programs, some of which may be non-numeric and difficult to parallelize using a 
compiler. Since the SMT model emphasizes instruction throughput, we have run all 
possible 1, 2, 3 and 4 combinations of the selected benchmarks, and averaged the 
overall throughput. Each simulation was stopped after 50 million simulation cycles. 
This gave us approximate mn lengths between 50 million and 200 million instructions 
per simulation, depending on the number of benchmarks in each combination. 



4 Simulation Results 

In this section, we investigate the performance of 3 different architectural models. To 
execute several threads simultaneously, the front-end of the processor has to provide a 
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lot of bandwidth. Current microprocessors typically allow up to 4 instructions to be 
fetched and issued into the scheduling window per cycle. To offer a significantly 
better bandwidth on a multithreaded processor, the most forward way is to increase 
the number of blocks that can be issued simultaneously. Using this approach, the 
instruction cache would be multi-ported, and the decode and rename units would be 
duplicated. For our baseline model, we have chosen a configuration that doubles the 
fetch and issue bandwidth of typical current superscalar processors for a small 
increase in complexity. The base configuration has an instruction cache with 2 fetch 
ports, and can rename and issue two blocks of 4 instructions from two different 
threads every cycle. As long as there is more than one active thread, the base machine 
can provide the peak issue bandwidth of an 8-wide processor, but with fetch and 
rename complexity per thread of a 4-wide processor. The performance of the base 
model is presented in section 4.1. We also present in section 4.2 results for two more 
aggressive architectures featuring one and two 8-wide fetch ports respectively. These 
two architectures may be more complex to build than the baseline machine because of 
the high frequency of branches in general purpose programs, the high number of read 
ports into the rename table, and the intra-block dependency check and bypass logic at 
the rename table outputs [17]. 



4.1 Baseline 

Figure 2 shows the performance of two multithreaded architectures that use the 2x4 
fetch/issue configuration of the base machine. The SMT key refers to a conventional 
simultaneous multithreading architecture with 8 thread contexts. The DSMT key 
refers to an SMT architecture capable of creating hardware threads, a la DMT, also 
with 8 thread contexts. DSMT performs consistently better than SMT, with an 
average increase in throughput of 5%, 8%, 13%, and 28% for 4, 3, 2 and 1 programs 
respectively. These results highlight the performance potential of mixing speculative 
multithreading with the simultaneous execution of independent software threads. 




□SMT 
□ DSMT 



Fig. 2. Average IPC for the SMT and DSMT architectures, with two 4-wide fetch ports 
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It should be noted that when running one program, the SMT architecture behaves 
as a simple 4-wide superscalar processor, while the DSMT architecture becomes the 
DMT model presented in [2]. There is a possibility for an improvement on the SMT 
model by implementing a mechanism to predict two addresses per cycle, in order to 
use both fetch ports in the same cycle when there is only one active thread. One such 
mechanism has been proposed in [18]. The DSMT model would also benefit from that 
mechanism, but to a lesser degree, since the frequency of having only one thread in 
active fetch is limited, as shown in table 3. It should also be noted here that the results 
of the comparison between DSMT and SMT might vary if a different thread 
arbitration policy such as Icount [3] is used. 

Table 2 gives the average number of active threads on the DSMT model (both 
speculative and non-speculative) when 1, 2, 3 or 4 programs are running. When only 
one program is present, the DSMT architecture average thread utilization is 3.6, 
significantly less than the machine total of 8 threads. With 4 programs, 85% of the 
thread contexts are utilized. Notice that the increase in throughput over SMT with 3 
programs mnning is significant, even though there are 3 speculative hardware threads 
competing for resources with the 3 programs, on average. This indicates that the 
benefits of the speculative hardware threads out-weigh the misprediction penalties 
and the associated loss of fetch and execution bandwidth. Even with 4 programs, there 
is still a 5% gain in the average IPC. Table 3 gives the time distribution of active 
threads for 1, 2, 3 and 4 programs running in DSMT mode. 



Table 2. Average number of DSMT active threads for various numbers of programs 



1 Program 


2 Programs 


3 Programs 4 Programs 


3.6 


4.8 


6 6.8 



Table 3. Time distribution of the number of active threads for various numbers of simultaneous 
programs running in DSMT mode 



Threads 


1 2 


3 


4 


5 


6 


7 


8 


1 Program 


26% 15% 


16% 


12% 


8% 


7% 


8% 


8% 


2 Programs 


15% 


18% 


17% 


14% 


11% 


9% 


16% 


3 Programs 




10% 


14% 


16% 


16% 


15% 


29% 


4 Programs 






8% 


13% 


16% 


18% 


45% 



An interesting note on the DSMT model is related to the cache hierarchy design. A 
study presented in [19] showed that going over 4 programs on an SMT architecture 
with a conventional memory hierarchy would not be effective since the L2-cache 
bandwidth becomes a severe bottleneck. The fact that the speculative threads are very 
likely to belong to the execution path of the running programs is a valuable 
characteristic of the DSMT model. This reduces the pressure on the cache hierarchy 
and allows the DSMT model to utilize effectively the 8 available thread contexts. 
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The Asymmetric model. One of the major concerns regarding SMT architectures is 
that the high execution throughput is reached without direct control of the threads. 
This can result in an imbalance of the execution throughput of the individual threads. 
This phenomenon is magnified hy priority mechanisms, such as Icount or Brcount, 
introduced in [3] to enhance the overall execution rate hy giving more resources to the 
best performing thread. However, many users may look for low latency on one 
program, even when several other programs are executing in parallel. It is unrealistic 
to expect both the highest total throughput and the lowest latency of one high-priority 
program at the same time, but it may be useful to find ways to reduce the performance 
loss of a single high-priority program, when other lower priority threads are mnning 
simultaneously. 

To address this problem, [20] presents a scheme based on a simple fetch-stage 
prioritization. Low-priority threads can only share the fetch bandwidth that is not used 
by the high-priority thread. This scheme limits substantially the latency degradation 
of the priority thread due to SMT execution, but at the cost of a global loss of 
throughput. 

We have evaluated an asymmetric architecture (ASMT) in which one high priority 
program is favored in comparison to other background programs. In our simple 
model, the execution of one selected foreground program is supported by the 
dynamic multithreading technique, while the other lower-priority programs are 
executed in non-speculative mode using only the SMT technique. Figure 3 shows that 
the ASMT processor model exhibits only a slightly lower average performance than 
the DSMT model (except for one program, where the two models are equivalent) and 
that the average performance remains higher than the SMT model. 




□SMT 
□ DSMT 
□ASMT 



Fig. 3. Average IPC for the ASMT architecture with two 4- wide fetch ports 

Even though ASMT outperforms SMT in total throughput, the high priority 
program performs significantly better in ASMT than in SMT or DSMT mode. Figure 
4 illustrates more precisely this advantage of asymmetric multithreading execution for 
two different applications. Among the benchmarks we have simulated, vortex has the 
highest execution throughput in single thread execution mode and gcc has one of the 
lowest average IPC numbers. The metric displayed in Figure 4 is the average IPC 
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exhibited by each of gee and vortex running as high-priority program, while various 
combinations of the other applications in our suite are running as background threads. 
The performance gain brought by the ASMT processor is significant and increases as 
the number of simultaneous threads increases. When 4 programs are executing, the 
ASMT model allows gcc to exhibit an average of 18% and 21% higher throughput 
than the DSMT and SMT models respectively. In the case of vortex, the performance 
advantage given by ASMT execution reaches 23% and 31% respectively. It should be 
noted that with two programs running, the ASMT foreground program still 
outperforms the SMT architecture running the same program alone. With 3 programs 
running, the degradation of the foreground program is still very small compared to the 
single thread performance on the SMT architecture. 




Fig. 4. Average IPC of a foreground application when running alone, or with 1, 2 or 3 other 
programs, for 3 different execution models with two 4-wide fetch ports. 

The ASMT processor model provides an option for having high throughput 
without sacrificing the execution latency of one particular program. A software 
controllable mode could be implemented to allow switching between DSMT and 
ASMT execution. It is important to note that the ASMT mechanism still results in a 
significant loss in the foreground program performance when several other programs 
are active. The loss is 34% for gcc and 41% for vortex when running with 3 other 
programs. We are currently investigating more aggressive priority schemes and 
instruction schedulers to reduce this loss. 



4.2 Using 8-wide issue model 

In this section, we explore alternative architectures featuring different fetch and issue 
mechanisms. The first one is an 8-wide issue machine with a single 8-wide fetch port. 
It offers the same peak fetch and issue bandwidth as the 2x4 base machine studied in 
section 4.1. The second architecture feeds instructions into an 8- wide rename unit 
using two 8-wide fetch ports. It can simultaneously rename instructions from multiple 
threads, if available, to fdl up the rename block. 



68 



Haitham Akkary and Sebastien Hily 



Sharing a single 8-wide fetch port. In this model, the active threads share a single 
fetch port, on a round-rohin basis. The thread accessing the instruction cache can 
fetch up to 8 instructions in a single cycle. This model offers the same peak fetch and 
issue bandwidth as the base machine, and could be equally or even more complex to 
implement. First, it shares all the pipeline stages. This means the rename unit must 
handle renaming for 8 instmctions of the same thread simultaneously. Second, 8-wide 
fetch could be quite complex, especially if multiple branch prediction and a trace 
cache [21,22] were used to maximize fetch bandwidth. 

Figure 5 shows the average throughput for the different multithreaded execution 
models that were defined in section 4. 1 (SMT, DSMT, ASMT). The performance of 
the SMT and DSMT models on the 2x4 base machine is also shown for comparison. 




□ SMT 2x4 
□DSMT 2x4 

□ SMT 1x8 
□DSMT 1x8 
■ ASMT 1x8 



1 Program 2 Programs 3 Programs 4 Programs 



Fig. 5. Average IPC for various multithreaded architectures with one 8-wide fetch port. The 
performance of SMT and DSMT for two 4- wide fetch ports is given as a reference. 

First, we compare the performance of the 1x8 and 2x4 SMT machines. As expected, a 
boost in performance can be noticed for the SMT architecture with 1 program, in 
comparison to the base machine. The fetch and issue bandwidth available when only 
one thread is active is double the bandwidth offered by the 2x4 base machine. The 
gain is lower when running SMT with 2, 3 and 4 programs, since the 2x4 machine has 
a good chance of using both fetch ports. In this case, it is only when some threads stall 
due to instruction cache misses that it may not be possible to utilize both fetch ports. 

Second, looking closely at the SMT and DSMT performance on the 1x8 and 2x4 
machines shows a very complex behavior. There are many interesting interactions 
occurring. On one hand, having many threads active and fetching two 4-instmction 
blocks every cycle helps performance by limiting instruction cache fragmentation 
effects (misalignment and taken branches). On the other hand, when there is only one 
thread fetching instructions (other threads stalling due to instruction cache misses) or 
when a thread resumes execution along the correct path after a branch misprediction, 
8-wide fetch and issue helps feed the pipeline quickly. Speculative threads complicate 
things even further. On one hand, speculative threads are subject to control flow as 
well as data mispredictions and incorrectly predicted threads waste fetch resources. 
On the other hand, speculative threads in the DMT model increase fetch efficiency by 
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allowing instructions to be fetched and executed out-of-order relative to mispredicted 
branches. There are probably other subtle interactions that we have not yet uncovered. 

Characterizing all these interactions completely will require a lot of work, possibly 
on a larger set of benchmarks. From the measurements we have done, a trend seems 
to come out. The 1x8 wide fetch configuration seems to favor simulations with more 
programs (non-speculative threads) than speculative threads. In general, the 1x8 SMT 
performs better than 2x4 SMT, but the 2x4 DSMT performs better than 1x8 DSMT 
with the exception of 4-programs simulations, which have more non-speculative than 
speculative threads active (Table 2). 

Again, the ASMT model gives better throughput than SMT, and ASMT 
performance is very close to DSMT performance. However, the performance of the 
foreground program is hardly enhanced compared to the ASMT or DSMT model. In 
Figure 6, one can see that the overall IPC of gcc or vortex is increased, but the 
difference between the asymmetric execution and the regular SMT execution is very 
small: in the best case 7% for 2 programs, for both gcc and vortex. 




GCC 



-SMT 

1x8 

-DSMT 

1x8 

-ASMT 

1x8 

-ASMT 

2x4 

-ASMT 

2x8 



VORTEX 
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Fig. 6. Average IPC of a foreground application when running alone, or with 1 ,2 or 3 other- 
programs, for 3 different multithreaded architectures. 



In contrast to the 1x8 ASMT configuration, the 2x4 ASMT allows a significantly 
better IPC for the foreground program. The average gain in IPC brought by the two 
fetch ports over the single fetch port is 12%, 19% and 21% for gcc with 2, 3 and 4 
programs respectively, and 8%, 17% and 21% for vortex. Therefore, in order to build 
a speculative SMT microprocessor in which a good balance is maintained between the 
performance of a selected program and the overall throughput, two narrow fetch/issue 
ports should be favored over one wide port. 



Using two 8-wide fetch ports. Fetch bandwidth is a major bottleneck in 
multithreaded architectures. In this section, we look at a higher fetch bandwidth 
processor model in which the active threads share two 8-wide fetch ports. The 
processor is 8-wide issue, and in each cycle, up to 8 instmctions from up to 2 threads 
can be fetched. As many instructions as possible are taken from the first thread and 
the second thread fills the remaining available issue slots. The main benefit of this 
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new fetch mechanism is to limit block fragmentation while allowing high issue 
bandwidth when there is only one thread active. This front-end configuration remains 
reasonably balanced with the execution resources in the base model, whereas having a 
wider than 8-issue pipeline would require a bigger register file and more functional 
units. 

The performance exhibited by various multithreaded architectures featuring two 8- 
wide fetch ports is shown in Figure 7. As expected, SMT gets a major benefit from 
the new mechanism when the number of programs increases. The gain in IPC 
averages 5%, 13% and 25% for 2, 3 and 4 programs respectively compared to the two 
4-wide fetch ports model (3%, 10% and 16% when compared to the one 8-wide fetch 
port model). 




□ SMT 2x4 

□ DSMT 2x4 

□ SMT 2x8 

□ DSMT 2x8 
■ ASMT 2x8 



1 Program 2 Programs 3 Programs 4 Programs 



Fig. 7. Average IPC for various multithreaded architectures with two 8-wide fetch ports. The 
performance of SMT and DSMT for two 4- wide fetch ports is given as a reference. 

Again, speculative threads enhance significantly the overall throughput of the 
DSMT model and up to 3 programs. The gain averages 16%, 10% and 4% for 1, 2 
and 3 programs respectively. With 4 programs, the 4 spare contexts are not sufficient 
for efficient speculative thread execution. In general, several free contexts are needed 
to achieve significant performance gain with DMT execution [2]. This appears to be 
the case here as well for the DSMT and ASMT configurations, when multiple 
programs are running. 

The comparison of the relative performance of DSMT with 4-wide or 8-wide fetch 
ports is another result that reflects the trend pointed out earlier. The gain in going 
wider is only 2% on average for 1 or 2 programs, but 8% and 19% for 3 and 4 
programs respectively. This shows again that if we have enough resources to create 
more speculative than non-speculative threads, the contribution of the speculative 
threads to performance outweighs the benefit of having wider fetch blocks. When 
there is less opportunity for speculative thread execution (e.g. 3 and 4-program 
simulations), the benefit of having wider fetch blocks becomes significant. 

The ASMT model exhibits very good overall performance, and with 4 programs, it 
shows the highest IPC. In fact, the ASMT architecture has 4 spare contexts, when 4 
programs are miming, to create speculative threads for the priority program. This is a 
sufficient number of thread contexts to significantly boost the performance of the 
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high-priority foreground program. On the other hand, the high fetch bandwidth lowers 
the negative impact of the speculative threads on the other concurrent programs. Both 
observations together help explain the unexpected result that with 4 programs, the 2x8 
ASMT model provides the best total throughput of any other configuration we have 
simulated, as well as the best high-priority program performance of all 4-program 
simulations, as shown in Figure 6. 



5 Summary 

We have evaluated an architecture that combines two approaches to multithreading: 
simultaneous and speculative dynamic multithreading. Simultaneous multithreading 
relies on sharing processor resources among software-generated threads or multiple 
programs to increase performance. Dynamic multithreading uses hardware 
mechanisms and data speculation to automatically create multiple threads from the 
same program. Dynamic multithreading provides the means to increase throughput on 
programs that have proven difficult to parallelize automatically with a compiler. 

The two techniques complement each other very well. When there is one or few 
programs to take full advantage of the SMT processor, speculative DMT threads 
increase the processor utilization and throughout via far ahead execution. On the other 
hand, the speculative DMT model still falls short of utilizing the full capability of an 
SMT processor. Running one or two other programs simultaneously significantly 
increases the processor utilization. 

The most striking example of the potential of the combined technologies is the 
asymmetric model with 2 programs. The foreground thread running in DMT mode 
can provide equal to or better performance than a conventional SMT processor 
running the program alone, while maintaining a significantly higher total processor 
throughput due to the background thread running simultaneously. 

We have also presented a performance comparison of several fetch and issue 
organizations. We have found that, in general, for the same total fetch and issue 
width, multiple fetch/issue ports organization favors speculative multithreading, while 
a wider fetch and issue pipeline is better for a pure SMT model. Increasing fetch 
supply using multiple narrow fetch/issue ports on DMT processors is not only 
attractive for its simplicity, but it also provides superior performance as well. 
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Abstract. Deeply pipelined high performance processors require highly 
accurate branch prediction to drive their instruction fetch. However there 
remains a class of events which are not easily predictable by standard 
two level predictors. One such event is loop termination. In deeply nested 
loops, loop terminations can account for a significant amount of the mis- 
predictions. We propose two techniques for dealing with loop termina- 
tions. A simple hardware extension to existing prediction architectures 
called Loop Termination Prediction is presented, which captures the long 
regular repeating patterns of loops. In addition, a software technique 
called Branch Splitting is examined, which breaks loops with iteration 
counts above the detection of current predictors into smaller loops that 
may be effectively captured. Our results show that for many programs 
adding a small loop termination buffer can reduce the missprediction 
rate by up to a difference of 2%. 



1 Introduction 

Branch prediction is the architectural feature which allows the front-end of 
the processor to continue fetching instructions in the presence of control flow 
changes. Branch prediction predicts the directions of branches during fetch, so 
the fetch engine knows which cache block to fetch from in the next cycle. When 
the wrong direction is predicted, the whole pipeline following the branch must 
be flushed, significantly impacting performance. 

Accurate branch prediction uses the past history of branches to predict the 
future behavior of a branch. Early branch prediction architectures used an N-bit 
saturating counter to predict the direction of each branch [11]. Using only an 
N-bit counter accurately predicts branches which are biased in either a taken 
or not-taken direction, but looses prediction accuracy if the branch exhibits a 
more complex pattern. To capture these patterns local and global branch history 
prediction were proposed [16,14]. 

Local branch history can be used to accurately predict a branch by storing 
the last L directions for a given branch, and using this to index into a 2nd level 
Pattern History Table (PHT) [16,14]. The PHT is a table of N-bit saturating 
counters used to predict the direction of the branch. Local branch history in- 
creases prediction accuracy by capturing arbitrary patterns that are repeated 
within the last L times the branch was executed. 
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Global branch history uses correlation information between branches to ac- 
curately predict a branch’s outcome. A history of the last G executed branch 
directions are kept track of in a global history register. The global history regis- 
ter is then used to index into a pattern history table of 2-bit counters to predict 
the branch direction. 

It is hard for local and global histories to capture loop terminations of the 
pattern ((l)'^O) , where 1 represents a taken branch and a 0 represents a fall- 
through branch. These patterns cannot be captured by a local history predictor if 
N is larger than L (the local history size). Loop termination can only be captured 
using global history if N is smaller than or there is a unique branching 
sequence right before the loop termination. For the programs examined in this 
study, 43% of the executed branches are loop branches, and 7% of the executed 
loop branches are mispredicted on average. 

In this paper we propose to use a Loop Termination Buffer (LTB) to predict 
branch patterns of type ((1)^0) . The LTB keeps track of branches with this 
behavior, and predicts when the pattern changes (terminates). This allows us to 
achieve up to 100% prediction accuracy for loop branches after a short warm up 
period. 

In addition, we examine the potential of using a software approach called 
Branch Splitting to correctly predict loop branches. Local branch history can 
only accurately predict loop terminations for branches that execute less than L 
times, where L is the number of bits used for the local history. For loops that 
have an iteration count larger than L, the loop guarding branch can be split 
into two or more branches all which have an interaction count less than L, as 
long as the product of new iteration counts equals the old iteration count. The 
new branches’ patterns will then fit into local history and will have all of their 
backwards and termination behavior accurately predicted. 

The rest of the paper is organized as follows. Section 2 describes Loop Termi- 
nation Prediction and our Loop Termination Buffer implementation. Section 3 
describes Branch Splitting. Simulation methodology is described in section 4. 
Section 5 evaluates the performance of Loop Termination Prediction and Branch 
Splitting. Section 6 describes prior research for loop termination prediction. Fi- 
nally, we summarize our contributions in section 7. 

2 Loop Termination Prediction 

Traditionally loops are thought of as the steady state operation of execution, 
and at first glance loop termination seems to account for only a small portion 
of the total amount of branches seen. However, this is an important part of a 
branch prediction architecture, because a regular loop has the miss rate which 
is the inverse of it’s iteration count. Since the branch prediction accuracy of 
most processors is already in the high nineties, loop branch mispredictions can 
account for a large fraction of the remaining branch mispredictions. 



^ Assuming that there are not any branches internal to the loop. 
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In this section, we propose using a very simple architecture extension which 
can predict how many times a loop branch will iterate to provide loop termina- 
tion prediction. 

2.1 Predicting Loop Termination 

To allow instruction fetch to continue without stalling, each cycle a traditional 
branch prediction architecture predicts (1) if the current instruction fetch con- 
tains a branch, (2) the direction of the branch, and (3) provides the branch 
target address to be used in the Tcache fetch in the next cycle. This prediction 
is typically performed given only the current instruction cache fetch address each 
cycle. 

The goal of our research is to predict when a loop branch terminates its loop- 
ing behavior. Therefore, in addition to the above prediction information, during 
branch prediction we must (4) determine that the branch is a loop branch, and 
(5) predict if it has terminated or not. In order to provide this prediction infor- 
mation, the branch has to be labeled as a loop branch in the branch prediction 
architecture and the loop’s trip count must be predicted. 

Identifying Loop Branches. Loop branches can be identified in hardware by 
either having special loop branch instructions (effectively having the compiler 
mark a branch as loop branch), or identifying the loop branches by looking at 
the sign bit of their displacement. 

For this study, we targeted loops with the conditional branch at the bottom 
of the loop as shown in the doubly nested loop code example in figure 1. The 
number of loop branches generated this way depends upon the compiler being 
used. The Compaq C and FORTRAN Alpha compilers (with -04 optimization) 
compiles most for, while, and do loops with the conditional branch check at 
the bottom of the loop with a negative displacement. 

For our programs, loop branches usually have a negative displacement, and 
we dynamically predict that a conditional branch is a loop branch if it has a 
negative displacement. While this is not a perfect definition of a loop conditional 
branch, we used it to dynamically classify in hardware which branches are loop 
branches, so that we may know which branches should attempt to use loop 
termination prediction. 

Predicting Loop Trip Count and Termination. In order to correctly pre- 
dict loop termination the branch predictor needs to (1) predict the loop trip 
count for the branch, and (2) keep track of the current loop iteration for the 
loop branch. 

The loop trip count is the number of times in a row the loop branch is taken. 
The trip count can be a constant determined at compile-time for some loops, 
constant for a given run of an application but only determined at run-time for 
that run, or dynamically changing during a program’s execution. Predicting the 
loop trip count can catch all three of these types of loop branches. 

We use a loop iteration counter to record the current iteration count - the 
number of times the branch has been taken since it was last not-taken. The 
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for ( i=0; i<N; i++ ) 

{ 

for ( j=0; j<M; j++ ) 
{ 

do something 





Fig. 1. Doubly nested for loop Fig. 2. Loop Termination Buffer 

iteration counter is used to (1) predict when the loop has terminated, and (2) 
to initialize the loop trip count. 

The loop termination prediction information described above could be stored 
directly into a Branch Target Buffer used for predicting target addresses during 
branch prediction [9,15]. Instead of doing this, in our study we examine having 
a small associative buffer on-the-side to accurately predict the loop termination 
for loop branches. 



2.2 Loop Termination Buffer 

The Loop Termination Buffer (LTB) is a small hardware structure capable of de- 
tecting the impending termination of loop-branches. The prediction mechanism 
takes advantage of the fact that many loops have trip counts that do not change 
often over the course of execution. Take for example the doubly nested loop in 
figure 1. If we assume a traditional predictor which has been properly warmed 
up, the inner loop will cause N branch misses, and the outer loop will cause 
one, for a total of N-l-1 misses. The inner branch has a highly regular pattern 
of being taken N-1 times and then falling through the Nth time. This last time, 
the loop’s termination, is an event which we wish to correctly predict. 

As can be seen in figure 2, the LTB has five fields: a tag field to store the 
PC of a branch; a speculative and non- speculative iteration count to store the 
number of times the branch has been taken in a row; the loop trip count field 
to track the number of consecutive times the loop-branch was taken before the 
last not-taken; and a confidence bit indicating that the same loop trip count has 
been seen at least twice in a row. 

During Branch Prediction. To access the LTB, the fetch PC, which is used 
to perform the normal branch prediction, is also used to index in parallel into 
the LTB. If there is a tag match, the speculative iteration counter is checked 
against the trip count. If they are equal, the branch is a candidate for predicting 
termination. The confidence bit is then tested. If it is set, the loop branch is 
predicted to be not-taken (exiting the loop) . If instead the speculative iteration 
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counter and the trip count are not equal, then the speculative iteration count is 
speculatively incremented by one. We count from 0 up with the iteration counter, 
instead of down, so that we have the iteration count available in the counter for 
initialization of the trip count. 

When Resolving a Branch. A branch is considered for insertion into the LTB 
when it completes and its branch direction is resolved. When a branch resolves 
and is found to not be in the LTB, it is inserted into the LTB if determined to be 
a loop-branch and if it is mispredicted by the default predictor. In our study, we 
assume backwards conditional branches are loop-branches. When inserted into 
the LTB, all of the LTB entries counters are initialized to zero. Since an entry 
is inserted into the LTB when a loop-branch is found to be mispredicted, there 
are no outstanding branches and this allows the speculative and non-speculative 
iteration counters to start out synchronized. 

During resolution, a taken branch found in the LTB has its non-speculative 
iteration counter incremented by one. 

A not-taken loop-branch found in the LTB during branch resolution updates 
its trip count and confidence bit. If the non-speculative iteration counter is equal 
to the trip count stored in the LTB, then the confidence bit is set, otherwise it is 
cleared. The non-speculative iteration count is then incremented by one and is 
copied to the trip count. The speculative iteration count is then set to the current 
speculative iteration count minus the non-speculative iteration count. The spec- 
ulative iteration count is reset to this value because the same loop-branch may 
have already been fetched again before the not-taken branch resolves. Finally, 
the non-speculative iteration counter is reset to zero. 

One reason for having two iteration counters, as shown in figure 2, is to re- 
cover the iteration counters during a branch misprediction. When a branch mis- 
prediction occurs, all of the non-speculative iteration counters copy their values 
into the speculative iteration counters. Therefore, this synchronizes the specula- 
tive and non-speculative counters. This approach is similar to prior architectures 
proposed to recover branch prediction state during a misprediction [10]. 

2.3 Loop Termination Predictor 

Applying the loop termination buffer to branch prediction is a fairly straight 
forward process. When a loop branch is predicted to terminate, the branch is 
predicted as not-taken. Otherwise the branch is predicted taken. Although a 
more integrated approach of loop termination prediction into existing branch 
predictor architectures is possible (e.g., putting the counters into the branch 
target buffer), we examined using a separate buffer to concentrate on the effect 
of loop termination prediction. 

Looking to figure 3 we can see how the branch predictors are combined at a 
high level. The final predictor generates a prediction for every branch, however 
the primary predictor will be overridden if the loop termination buffer generates 
a confident prediction. Recall from above that a confident prediction is generated 
if the branch PC is found in the LTB and it has seen the same trip count twice 



in a row. 
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Final Prediction PC 



Fig. 3. Using the Loop Termination 
Buffer for Branch Prediction. The pri- 
mary predictor is used except in the 
case where the loop termination pre- 
dictor predicts loop exit and is highly 
confident 



Origninal Code 



i=0; 

Lli i++i 

br(i<100)Ll 




Loop Split Code 

i=0; 

j=0; 

Li: i++; 
j++; 

br(j<10)Ll 

j=0; 

br(i<100)Ll 



Fig. 4. Example of Branch Splitting. 
The branch in the original code will 
cause a branch misprediction 1 out 
100 times the branch is executed. The 
transformed code can fit in the local 
history and will correctly predict both 
loop branches 



3 Branch Splitting 



A local branch history can only accurately predict loop terminations for branches 
that execute less than L times. For example, let us assume that the local history 
size is 12. If the predictor is predicting a loop branch whose iteration count is 
100 (i.e. the pattern (!®®0)^) as in figure 4, then the local history will not be 
able to predict the loop exit. 

Because the local branch history can only accurately predict loop termina- 
tions for branches that execute less than L times, where L is the number of bits 
used for the local history, breaking the loop into multiple smaller loops will allow 
local history to correctly predict them. For loops that have an iteration count 
larger than L, the loop guarding branch can be split into two or more branches 
all which have an interaction count less than L, as long as the product of new 
iteration counts equals the old iteration count. The new branches’ patterns will 
then fit into local history and will have all of their backwards and termination 
behavior accurately predicted. 

Continuing with example above, let us split the branch into two branches, B1 
with iteration count of 10, and B2 with iteration count of 10. This will create a 
doubly nested loop, whose loop body will be executed the same number of times 
as the original loop. Both the branch patterns will now be (1®0)^ which can be 
easily captured in the local history. This is very much akin to the technique of 
loop tiling, except here the resource we are tiling for is the branch predictors 
local history. 

To apply branch splitting, we need accurate knowledge of the loop bounds, 
in order to create new bounds with little or no cleanup code. If cleanup code 
is generated, then this may more than offset the branch mispredictions saved 
by the main loop. In addition, when applying branch splitting one needs to 
be aware of the increased pressure splitting the loop branch will create on the 
branch predictor, since it will increase the number of entries used. 
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Fig. 5. Percentage of conditional 
branches executed which are looping 
branches, as detected by being back- 
wards, or as detected by hardware or 
software 



Fig. 6. Confident Prediction Accuracy. 
The prediction accuracy of the LTB 
when it returns it’s prediction as con- 
fident 



4 Methodology 

To perform our evaluation we gathered results for 13 of the SPEC95 INT and FP 
benchmarks. The benchmarks were compiled on a Alpha AXP-21164 processor 
under OSF/1 V4.0 using the DEC C and FORTRAN compilers. All benchmarks 
were compiled at full optimization (-04 -ifo). Since we are using full optimiza- 
tion, some loops in the programs were unrolled and software pipelined. 

For the results in this paper we used ATOM [12] to instrument the programs 
and gather simulation results. The ATOM instrumentation tool has an interface 
that allows the elements of the program executable, such as instructions, basic 
blocks, and procedures, to be queried and manipulated. In particular, ATOM 
allows an “instrumentation” program to navigate through the basic blocks of 
a program executable, and collect information about registers used, opcodes, 
branch conditions, and perform control- flow and data-flow analysis. Programs 
were executed for a total of five billion instructions or program termination. 



5 Results 

To evaluate the benefit of loop termination prediction we examine the branch 
prediction performance of two predictors based on McFarling’s meta predic- 
tor [8]. 

McFarling’s meta predictor has a meta chooser table of 2-bit counters to 
choose between bimodal and gshare branch prediction. We use a bimodal table of 
2-bit counters indexed by the PC to produce the bimodal prediction. In addition, 
we use a global history register XORed with the branch PC as an index into a 
table of 2-bit counters to provide the gshare prediction. The meta table is also 
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Table 1. Percent of executed loop branches that had an average trip count 
shown in each column header. For example, the results for apsi show that loop 
branches, which had a loop count between 40 and 69 iterations, accounted for 
69% of the executed loop branches in the program 
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0.00% 


0.00% 


0.00% 


0.00% 


vortex 


82.61% 


0.00% 


1.97% 


3.69% 


1.64% 


7.98% 


1.24% 


0.87% 


0.00% 


wave5 


3.19% 


0.00% 


0.00% 


2.63% 


0.00% 


32.96% 


6.56% 


54.66% 


0.00% 



indexed with the PC, and the 2-bit counter keeps track of which predictor (the 
bimodal predictor or gshare predictor) is more often correctly predicting the 
branch, and then the corresponding prediction is used. 

The predictor with local history, called here the local/global chooser or LGC, 
is very similar to the McFarling predictor with the exception that the bimodal 
predictor is replaced with two tables for local history prediction. A per branch 
history is tracked in a first level local history table, which is then used to index 
into a second table of 2-bit counters. This extra level of indirection allows local 
branch history patterns to be captured. The major difference between the LGC 
and the branch predictor found on the Alpha 21264 [7] is that LGC uses the PC 
to index into the chooser, as opposed to using the global history. 



5.1 Loop Characterization 

We begin our analysis with a characterization of the loop behavior in the pro- 
grams we examined. Figure 5 shows the percent of conditional branches exe- 
cuted which we classified as loop branches. These are executed branches having 
a negative sign displacement as described in section 2.1. The graph also shows 
the fraction of these executed branches that were found in our loop termina- 
tion buffer (Hardware Loop Detection) during execution. Then layered on top 
of that is the percent of executed branches that had branch splitting applied 
to them. For swim, almost 100% of its executed branches were loop branches. 
These were all found in the LTB, and had branch splitting applied to them. For 
applu, 63% of its executed branches were classified as loop branches, 34% of the 
executed branches were found in the LTB, and 9% of the executed branches had 
the branch splitting optimization applied to them. 
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We further examine these loop branches by breaking them down in terms of 
their iteration count. Table 1 shows the breakdown of loop branches in terms 
of the iteration count represented by the range in the column headers. These 
numbers are calculated by taking the average trip count for each static branch, 
multiplying this by the number of times the branch was executed, and then 
dividing this by the total number of loop branches executed. For example, the 
results for mgrid show that 88% of the loop branches executed were executed in 
a loop which went from 0 to a loop trip count somewhere between 40 to 69. 

In looking at the results and source code for compress, we saw that 97% 
of the loop branches that were executed were in a loop which iterated between 
0 and 7. This is the reason for the dominating short trip counts in compress. 
The Compaq compiler did not unroll this loop, perhaps due to a rarely taken 
break statement inside the loop. For this program, local (per-branch) history 
will correctly predict the loop branches, because the trip count is less than the 
local history size (see figure 7). But for many of the other programs, local branch 
history can not correctly capture all of the loop termination. The local history 
length is not sufficient to capture a trip count in the range of 40 to 69 iterations 
as in mgrid. 

5.2 Branch Splitting 

To examine the potential of Branch Splitting as presented in section 3, we pro- 
file each branch over the execution of the application. We track the number of 
mispredictions from each branch, along with information on the number of iter- 
ations between loop exists, and the regularity of the pattern found. A backwards 
branch is said to be regular if the branch direction pattern is ((1)^0) over the 
entire execution of the program, so the iteration count was always iV -|- 1 . If the 
branch is regular and the cycle count is larger than the local history size, then 
we applied branch splitting to the loop branch. The light gray part of the bar in 
figure 5 shows the percent of executed branches splitting was applied to. 

Table 2 shows the potential reduction in branch mispredictions from applying 
branch splitting. The first two columns are the before and after overall branch 
misprediction rates, and the third column is the difference between them. The 
next column shows the total number of static branches in the executable. The 
last column is the percent increase in static branches when branch splitting was 
used. While the number of dynamic branches should not change significantly, any 
extra static branches can lead to more collisions in the table. Some applications 
such as apsi do quite well, achieving an absolute decrease in misprediction rate 
of 1.3%. 



5.3 Loop Termination Prediction 

In this section, we examine the reduction in branch misprediction rate from 
adding our Loop Termination Buffer to the Meta predictor, represented by Meta 
and Meta+LTP, and to the LGC predictor, represented by LGC and LGC+LTP. For 
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Table 2. Effects of Branch Splitting. The first two columns are the before and 
after overall branch misprediction rates, and the third is the difference between 
them. Static branches is the total number of branches in the executable. The 
last column is the percent increase in static branches when branch splitting is 
used 



app 


miss rate 


after splits 


difference 


static branches 


% branch inc 


applu 


2.79% 


2.52% 


0.27% 


1153 


4.08% 


apsi 


2.64% 


1.33% 


1.31% 


1642 


4.45% 


compress 


10.96% 


10.96% 


0.00% 


182 


1.65% 


gcc 


9.07% 


9.03% 


0.04% 


15952 


0.42% 


hydro2d 


0.33% 


0.09% 


0.24% 


1667 


7.26% 


mSSksim 


4.67% 


4.19% 


0.48% 


1017 


0.10% 


mgrid 


1.64% 


1.61% 


0.03% 


1172 


0.34% 


su2cor 


4.15% 


3.83% 


0.32% 


1717 


3.38% 


swim 


0.20% 


0.00% 


0.20% 


983 


3.26% 


tomcatv 


0.99% 


0.63% 


0.37% 


891 


3.03% 


turbSd 


2.98% 


2.65% 


0.32% 


1274 


1.18% 


vortex 


1.69% 


1.69% 


0.00% 


6259 


0.18% 


wave5 


0.74% 


0.30% 


0.44% 


1794 


2.84% 



the results gathered, each branch prediction table (Meta, Local, Bimodal) has 
32K entries per table. For the loop termination predictor, we simulate adding 
a small 32 entry fully associative LTB with random replacement to the base 
predictor. The size of this predictor (for both the tag and data) is less than 256 
bytes. 

The confidence prediction accuracy of LTP is shown in figure 6, and is in- 
dependent of the base predictor used. This is the percent of loop branches that 
were predicted to be accurate by the confidence counter in the LTB. Recall from 
section 2 that the confident bit is set when the the branch PC is found in the ta- 
ble and it has seen the same iteration count two times in a row. The reason that 
the prediction accuracy is not near 100% for all applications is that some loops 
have break statements, and/or they change the number of times they iterate 
during the execution of the program. 

We next examine the effects of LTP on the prediction accuracy of loop 
branches and all executed branches. Figure 7 shows the branch misprediction 
rates for only the loop branches, and figure 8 shows the misprediction rate in 
terms of all executed branches. Results are shown using 32K entry tables for 
the base Meta and LGC predictor, and when adding our LTB to the predic- 
tor. Compress, as discussed above, has a small inner loop that contains a break 
statement and is not unrolled and pipelined by the Compaq C complier. This 
loop escapes the Meta prediction but is fully captured in the LTB. It is also cap- 
tured by the LGC predictor in its local history. The results in figure 8 show that 
we reduce the overall miss rate for mgrid from 1.7% down to 0.1% using loop 
termination prediction. Figure 9 shows the code example of one of the routines 



Loop Termination Prediction 



83 




Fig. 7. Loop Misprediction Rate. The prediction accuracy of McFarling’s Meta 
predictor and a local/global chooser with 32k entry tables, with and without 
loop termination prediction for only loop branches 




Fig. 8. Overall Misprediction Rate. The prediction accuracy of Meta and LGC 
with 32k entry tables, with and without loop termination prediction for all 
branches 

SUBROUTINE RESIDCU, V,R,N, A) 

INTEGER N 

REALMS U(N,N,N) ,V(N,N,N) ,R(N,N,N) ,A(0:3) 

INTEGER 13, 12, II 
DO 600 13=2, N-1 
DO 600 12=2, N-1 
DO 600 11=2, N-1 
600 R(I1,I2,I3)=VCI1,I2,I3) 

> -AC0)*C U(I1, 12, 13 ) ) 

> ... additional matrix computations left out of example 
CALL C0MM3(R,N) 

RETURN 

END 



Fig. 9. Looping code example from mgrid, which accounts for 22% of the mis- 
predicted branches 
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Table 3. Average number of instructions between a mispredicted branch for 
each of the predictors and table sizes simulated 



name 


32k Meta 


32k Meta-I-LTP 


32k LGC 


32k LGC-tLTP 


applu 


1969 


1310 


1898 


2617 


apsi 


3394 


5863 


1755 


2745 


compress 


54 


63 


85 


85 


gcc 


147 


151 


105 


107 


hydro2d 


5206 


26277 


5071 


23871 


mSSksim 


311 


430 


414 


650 


mgrid 


4370 


197808 


4320 


215815 


su2cor 


333 


363 


331 


360 


swim 


25002 


2721829 


24994 


2635741 


tomcatv 


3731 


9356 


3691 


9111 


turbSd 


1508 


1532 


2102 


2244 


vortex 


1562 


1599 


1306 


1397 


waves 


6844 


16752 


10421 


45375 



in mgrid which accounted for 22% of the branch mispredictions. Almost all of 
these were eliminated using loop termination prediction. 

Table 3 shows the average number of instructions between mispredicted 
branches with and without loop termination prediction for each of the 3 simu- 
lated systems. The results show a large increase in additional ILP exposed by 
eliminating the mispredictions due to loop termination branches. 



5.4 Compiler Interaction 

Our loop termination prediction architecture is a highly accurate structure that 
rarely makes incorrect predictions. However, one fairly common loop type will 
cause problems. Loops with break or continue instructions. For example, a loop 
with a break statement in it will have the same trip count several times in a row, 
and then terminate abruptly. While these types of loops are still predicted with 
higher accuracy than in a traditional predictor, they prevent perfect prediction, 
even with confidence. Table 2 shows the results of eliminating mispredictions 
for only loop branches (using branch splitting) for only branches that always 
have the same iteration count. Figure 8 shows that we achieve a much higher 
reduction in misprediction rate using hardware loop termination prediction. This 
is because the LTB is accurately predicting loops whose iteration count change 
over the lifetime of the program. 

A common compiler optimization to improve memory performance it to tile 
loops. This optimization creates smaller inner loops to traverse over the data, 
keeping as much data in the cache creating reuse and reducing cache misses. 
This creates much more loop termination behavior, which could be correctly 
predicted by our loop termination buffer. In addition, our branch splitting results 
indicate that it may be worthwhile to also take into consideration the size of the 
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branch history registers for branch performance, when tiling a loop for memory 
performance. For the results gathered in this paper, the compiler did not perform 
any tiling. 



6 Related Work 

Branch prediction research has concentrated on reducing aliasing into branch 
prediction tables, and to increase accuracy by using chooser/meta predictors 
which allow branches to be predicted correctly by both local or global history, 
based upon confidence information [1]. 

6.1 Other Branch Predictors for Predicting Loop Termination 

Recently, researchers have examined using prior committed values [4] or value 
prediction in combination with traditional branch prediction to improve branch 
prediction accuracy [2,3]. These predictors can accurately predict loop termi- 
nations for long loop histories, since they predict the branches based on past 
or predicted values. They are more general and can potentially capture more 
data correlation for branches besides loop termination, although at the cost of 
adding large buffers to store values or value differences. Our loop termination 
buffer approach accurately predicts loops even with just the addition of a small 
32 entry prediction buffer (less than 256 bytes in size). In addition, they are 
fundamentally different when predicting loop terminations from LTB. While the 
value-based approaches predict the input operands (or their difference), the LTB 
predicts loop trip counts and keeps track of the current speculative loop itera- 
tion of the branch. The benefit our LTB approach is its simplicity and low cost, 
and it can easily be added to an existing predictor with very little overhead or 
increase in cycle time. 

6.2 Predicting Loop Iterations for Speculative Thread Generation 

Knowing the speculative loop iteration is an important area of research for spec- 
ulative threaded execution. Tubella and Gonzalez [13] examined adding a Loop 
Execution Table (LET) and Loop Iteration Table (LIT) to a speculative threaded 
processor, to provide prediction information for loop iterations. The LET is used 
to predict the number of iterations for each loop to guide the generation of spec- 
ulative threads. Whereas the LIT is used to keep track of information related to 
the live-in registers and memory locations from the last loop iteration. 

Our loop termination buffer is very similar to the function of the loop execu- 
tion table. Since their goal was to guide speculative thread generation, they did 
not examine using the LET for branch prediction. Our LTB has a confidence 
predictor, which is not in the LET, to determine when to use the loop termi- 
nation prediction or the original branch predictor. In addition, the LTB keeps 
track of a speculative and non-speculative loop count, which is used to restore 
the loop count in the LTB in case of a misprediction. 
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6.3 Loop Counting Branch Instructions 

Some architectures have special instructions for loop branches. The IBM Power 
PC [-5] has a loop-based branch instruction that branches based on the value of 
a Count Register. The Intel IA-64 [6] has an identical branch called the Counted 
Loop Branch. For both of these architectures, when a loop-branch instruction is 
executed, if the count register is non-zero, then the register is decremented and 
the branch is taken. Otherwise, the branch is not-taken (the loop is terminated). 

The PowerPC and IA-64 architectures currently do not use this loop-based 
branch instruction for LTP. In order to use it for loop termination prediction, 
an architecture would need to speculatively keep track of the loop count value. 
Each time the loop-branch is predicted, the speculative loop count would then be 
decremented. When the speculative count value reaches zero, the branch would 
be predicted as terminated. In actuality, a stack of speculative count values 
would needed, so that the count register could be saved between procedure calls. 
This is if the loop-branch instruction can be saved in order to allow different 
nested procedures to use the loop-branch register. 

A stack-based speculative loop count predictor would be very similar to the 
Loop Termination Buffer we propose. Even so, the Loop Termination Buffer 
would be preferred, since it can capture more types of branches than those that 
are implemented using the loop-based branch instruction, and we can achieve a 
high degree of loop branch coverage with a LTB of size less than 256 bytes. 

7 Summary 

Loop terminations can constitute a large percentage of the mispredictions gen- 
erated by current branch predictors. In this paper we proposed two schemes 
for dealing with this problem, the Loop Termination Buffer (LTB) and Branch 
Splitting. 

The LTB is a hardware mechanism that detects and predicts branch patterns 
of the form ((1) 0) . The LTB tracks branches with this behavior and then 
informs the predictor when it has found such a pattern. The addition of the 
LTB allows regular loop guarding branches to be predicted with 100% accuracy. 
However, we also find a significant benefit from the LTB for branches that have a 
loop trip count that changes over the lifetime of the program. The LTB reduced 
the loop mispredictions from 5.4% down to 0.2% for mSSksim, and aided the 
overall prediction accuracy by a significant amount for most programs. 

In addition, we examined the potential of using a software approach called 
Branch Splitting to correctly predict loop branches. For loops that have iteration 
counts larger than may be captured by local history, the loop guarding branch 
may be split into two or more branches all of which have an interaction count 
that would be captured by local history. This approach resulted in only small 
reductions in misprediction rates for some programs, since we only applied it to 
loops that had the same iteration count for the execution of the whole program. 
For apsi, branch splitting decreased the misprediction rate from 2.6% down 
to 1.3%. 
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Abstract. The performance of a traditional cache memory hierarchy 
can be improved by utilizing mechanisms such as a victim cache or a 
stream buffer (cache assists). The amount of on-chip memory for cache 
assist is typically limited for technological reasons. In addition, the cache 
assist size is limited in order to maintain a fast access time. Performance 
gains from using a stream buffer or a victim cache, or a combination of 
the two, varies from program to program as well as within a program. 
Therefore, given a limited amount of cache assist memory, there is a need 
and a potential for “adaptivity” of the cache assists i.e., an ability to vary 
their relative size within the bounds of the cache assist memory size. We 
propose and study a compiler-driven adaptive cache assist organization 
and its effect on system performance. Several adaptivity mechanisms are 
proposed and investigated. The results show that a cache assist that is 
adaptive at loop level clearly improves the cache memory performance, 
has low overhead, and can be easily implemented. 



1 Introduction 

The area available for on-chip caches is limited and the size and associativity of 
a cache for a given processor cannot be significantly increased without causing 
an increase in the cycle time. A small area dedicated to a victim cache and/or 
a stream buffer [7] can increase the performance of the memory system while 
it may not be large enough to double the cache size. Victim caches eliminate 
conflicts and exploit temporal locality of the programs, while stream buffers 
exploit spatial locality because they fetch data that is likely to be accessed in 
the near future. We call a victim cache, a stream buffer or a combination of the 
two a cache assist. 

A cache assist needs to have a high degree of associativity, and it needs to 
have an access time equal to that of the level of cache utilizing it, i.e. its access 
time is very small. This imposes a limit on the size of the cache assist memory. 
In [8] it is shown that for any CMOS process technology the cache size cannot be 
increased too much without causing an increase in cycle time and access time. 

* This work was supported in part by the DARPA ITO under Grant DABT63-98-C- 
0045. 
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When both a victim cache and a stream buffer are desirable, their relative sizes 
have to be selected within the bounds of the (small) cache assist memory size. 

Unfortunately, neither a victim cache nor a stream buffer are a panacea: in 
some programs a victim cache performs much better, in others a stream buffer 
performs much better. In this paper we show that a dynamic combination of the 
two improves the overall performance the most. This happens across different 
applications as well as within a single application. 

We propose a simple system that allows the cache assist configuration to vary 
at run time. A set of four special instructions is used to change the functioning 
of the cache assist, making it work as a stream buffer, victim cache or a combi- 
nation of the two. A compiler can insert these instructions in the code at points 
it determines suitable by either static code analysis or using profile-directed 
feedback. 

While the hardware modifications are modest, the following questions deter- 
mining the feasibility of the approach need to be answered: 

1. when should the cache assist configuration be changed, 

2. how often is it necessary to reconfigure, 

3. what is the optimal reconfiguration policy? 

On one hand it would not be feasible to change the cache assist configura- 
tion every few instructions as the overhead associated with such reconfiguration 
would make the approach prohibitively expensive. On the other hand if we re- 
configure too infrequently, e.g. once per function call, we might miss some opti- 
mization opportunities because a function may contain a number of loops, each 
of them with a distinct cache behavior. 

It has been shown that the majority of dynamic instructions in a program 
are executed in innermost loops. An inner loop is also likely to have reasonably 
stable spatial/ temporal locality characteristics. This suggests that an inner loop 
may be a good place to change the organization of the cache assists and maintain 
the setting for the duration of such a loop. In this paper we propose and study 
different schemes of adapting the cache assist at loop level, trying to determine 
which one has better performance. We also propose other schemes using a more 
static assist memory partitioning and compare their performance with the loop- 
level adaptive cache assist configurations. 

We currently use a profile-based mechanism for the control of adaptation 
by the compiler. Future work will study the opportunity to use compile-time 
analysis for making adaptivity decisions. The size of the cache assist memory 
is very important from both the access time and the effectiveness of adaptivity. 
The effect of varying the cache assist size on the miss rate of the memory system 
is studied as well. 

2 Related Work 

Victim cache [7] is a mechanism that is aimed specifically at conflict misses. It 
predicts that a replaced line of data will be accessed again shortly and stores the 
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replaced data in a small fully-associative buffer on the refill path of the cache. 
On a cache miss, the victim cache is checked to see whether the data is present. 
If so, the data is copied from the victim cache to the cache. 

A stream buffer [7] is a mechanism to prefetch and store data. It consists of 
a FIFO memory plus an address generator. On a cache miss, all stream buffers 
are searched in parallel to find whether the data is present. On a hit, the data 
is copied to the cache and the stream buffer is refilled from successive addresses 
in the lower memory hierarchy. On stream buffer miss, a buffer is allocated and 
addresses following the miss address will be prefetched into the buffer. 

To the best of our knowledge there is no previous work in applying adaptivity 
to configure a cache assist memory. However, adaptivity has been applied in 
various forms. Selected examples of its use are: 

Adaptive routing pioneered by ARPANET in computer networks and, more 
recently, applied to multiprocessor interconnection networks [1], [3] to avoid con- 
gestion and route messages faster to their destination. 

Adaptive throttling for interconnection networks [3]. [16] shows that ’’opti- 
mal” limit varies and suggests admitting messages into the network adaptively 
based on current network behavior. 

Adaptive cache control of coherence protocol choice were proposed and in- 
vestigated in the FLASH and JUMP-1 projects [4], [11]. 

Adapting branch history length in branch predictors was proposed in [9] since 
optimal history length was shown to vary significantly among programs. 

Adaptive page size has been proposed in [14] to improve the page management 
overhead and it is used in to reduce the TLB and memory overhead in [12]. 

Adaptive adjustment of data prefetch length in hardware was shown to be 
advantageous [2], while in [.5] the prefetch lookahead distance was adjusted dy- 
namically either purely in hardware or with compiler assistance. A cache with a 
fixed large cache line is used in [10] in association with a predictor to only fetch 
the parts of the cache line that are likely to be used. 

Adaptive cache line size was shown to improve the miss rate without an 
appreciable increase in bandwidth in [18], [19] and [6]. A scheme for adapting the 
cache line size dynamically was proposed in [18]. A special adaptive controller is 
incorporated in the cache access controller to monitor the memory access pattern 
of an application and change the line size to double or half its original size at 
a time in order to suit the application’s needs. In [18] the cache line is truly 
variable, whereas [19] uses a set of four predefined values for the line size. A 
scheme that uses two fixed sizes was proposed in [6]. 

A method to use compiler provided information to do software assistance for 
data caches was proposed in [15]. The compiler decides through static analysis 
when data exhibits spatial or temporal locality and generates code to attach a 
special spatial/temporal tag. The tag is used by the hardware when deciding if 
cache lines replaced from the cache should be placed in a victim cache. 
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3 System Organization 

Figure 1 shows the components of the system being studied. It consists of a 3- 
level memory hierachy plus a partitionable cache assist memory that can function 
as either a stream buffer, a victim cache or a combination of the two. The cache 
assist memory consists of N cache-line sized buffers connected to LI fill path. 
Separate control units utilize the allocated memory as a victim cache or as a 
stream buffer. A fully associative write buffer with a line size identical to the LI 
line size is also used. 

The LI cache is direct mapped and the hit latency is assumed to be 1 cycle. 
The LI bus transfer takes 2 cycles. L2 is a 2- way set-associative with the access 
latency of 15 cycles. The main memory access latency is 100 cycles. 

When the processor requests data, the LI cached is searched. On a miss the 
victim cache and the stream buffer are searched in parallel. If both miss the 
request is sent to the next level of memory, otherwise the cache assist supplies 
the data. 

Associated with the cache assist area are configuration registers. The regis- 
ters contain the size of the victim cache, the size of the stream buffer, and hit 
counters for both of them. The configuration for the cache assist can be changed 
dynamically at run time using four operations: 

— shrink_stream_buffer (cache Jines_to_shrink) 

— shrink_victim_cache(cacheJines_to_shrink) 

— extend_stream_buffer (cache Jines_to_enlarge) 

~ extend_victim_cache (cache Jines_to_enlarge) . 

Extending the stream buffer marks the new entries as invalid, shrinking it 
does the same and deletes any pending requests from the “issued prefetch” queue. 
Shrinking and extending the victim cache sets the victim cache size register to 
the new value and, in the case of extending, marks the added entries as invalid. 

The compiler can insert these instructions in places in the program where 
static anlysis or profile based feedback determine that changing the configuration 
and relative sizes of the cache assists will improve the performance. 

4 Experimental Infrastructure 

4.1 Simulator 

The framework provided by the ABSS [13] simulation system is used in this 
study. ABSS is a simulator that runs on SUN Sparc systems and is derived from 
the MINT simulator [17]. 

The ABSS simulator consists of 5 parts: augmentor, thread management, 
cycle-counting libraries, user-defined simulator of the memory system and the 
application program. 

The augmentor program (called doctor) parses the original application as- 
sembly code, and adds instrumentation code that sends information about the 
loads and stores executed by the program to the simulator. 
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Fig. 1. System design 



Our custom memory architecture simulator simulates a 3-level memory hier- 
archy plus a highly configurable memory cache assist with modules for modeling 
a stream buffer and a victim cache. The sizes of the victim cache and stream 
buffer are changeable at run time via commands embedded in the simulated 
program. 



4.2 Compilation 

We have used version 2.95 of the GCC compiler collection to conduct all the 
experiments. The compiler back-end was modified to emit special code sequences 
before entering a loop, or on the code path for exiting a loop. Given that the 
compiler back-end is common to the G and Fortran?? compiler we were able to 
use this instrumentation for compiling all the SPEG95 benchmarks. 

The code sequences were used for adjusting the cache assist allocation, and for 
collecting statistics and identifying the loop (source file name and line number), 
and signaling to the cache simulator that a loop is being entered or exited. 

In order not to modify the behavior of the program, the code sequences leave 
the processor in the same state as it was before the sequence in question has run. 
This is achieved by saving and restoring all the registers that the code sequence 
uses, including the flag registers. Furthermore, the loop instrumentation is done 
in the assembly emitting pass of the compiler (the last compilation pass), so it 
does not affect the code generation. 
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All the benchmarks where compiled using the -02 optimization flag, the 
target instruction set was SPARC VSplus. 

4.3 Benchmarks 

The set of beirchmarks shown in Table 1 was choseir because it has a good mix 
of both numeric and iroir-irumeric programs, because they are fairly memory 
hierarchy iirtensive, and because SPEC95 is a standard set of benchmarks. All 
benchmark programs were simulated until completioir. 



Table 1. Beirchmarks used 



Benchmark 


Decription 


Instructions 


Memory references 


go 


Plays the game GO 


3.20e-bl0 


7.76e-b09 


ijpeg 


Image compression 


2.70e-bl0 


7.39e-b09 


perl 


Perl interpreter 


1.42e-bl0 


3.42e-b07 


apsi 


Calculates statistics on temperature 


3.74e-bl0 


1.20e-bl0 


fpppp 


Performs multi-electron derivatives 


3.18e-bll 


1.03e-bll 


swim 


Solves shallow water equations 


3.21e-bl0 


1.32e-bl0 


turbSd 


Simulates turbulence 


1.13e-bll 


2.86e-fl0 


wave 


Solves Maxwell’s equations 


3.80e-bl0 


1.20e-bl0 



For some of the experiments profiling was used to select an “optimal” cache 
assist configuration. Profiling was performed using the SPEC training input set. 
The profile information was then used to run the benchmarks with the reference 
input set. We have verified that such profiling is accurate. 

5 Performance Evaluation 

To compare the relative performance of different cache assist configurations we 
use two main metrics: miss rate and execution time. For each experiment we 
gather the following kinds of data in order to evaluate the cache and cache assist 
performance. 

— LI and L2 miss rates 

— number of hits in assist buffer 
~ miss rate reduction 

We define the following equation to determine the overall performance im- 
provement for the system: 



missjratejreduction = {oldjmissjrate — newjmissjrate) 
*100.0/oldjnissjrate 



( 1 ) 
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We simulate a base cache hierarchy with a 16KB direct mapped LI cache 
and a 256KB 2 way set-associative L2 cache. The line size is 32 bytes for LI and 
64 bytes for L2. We will call this the base system configuration. Figures 2 and 3 
show the LI and L2 miss rates respectively, for the benchmarks using the base 
configuration. Only swim and wave have LI miss rates that are greater than 
15% and, except for apsi, all of them have L2 miss rates less than 3%. 




Fig. 2. LI miss rate of 16KB direct-mapped, 32B line size cache 



5.1 The Performance of Individual Cache Assists 

The performance of the individual cache assists is evaluated using the base sys- 
tem configuration and either a 1KB victim cache or a 1KB stream buffer. Figure 4 
shows the miss rate reduction for each of the assists when compared to the base 
configuration. 

The effect varies from program to program. In go the stream buffer barely 
has an impact (under 5% miss rate reduction), but the victim cache reduces the 
miss rate by 50%. The same is observed for perl and fpppp where the victim 
cache reduces the miss rate much more than the stream buffer. The reverse is 
observed in the case of turbSd where the stream buffer reduces the miss rate by 
55%, but the victim cache only reduces it by 23%. For apsi, ijpeg and wave the 
difference is not as pronounced. 

The above results confirm the advantage of using a cache assist, but the type 
of cache assist that is most useful varies from application to application. Thus 
we conjecture that a system that has a cache assist that can be reconfigured 
between a victim cache or a stream buffer at run time on a per program basis 
would improve performance. 
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Fig. 3. L2 miss rate of 256KB, 2- way set associative, 64B line size L2 cache 
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The fact that memory accesses in a program very seldom follow a uniform 
pattern suggests that the effect of cache assists also varies within a program. To 
evaluate the effect of cache assists on different portions of the code we instrument 
and collect performance data for all the inner loops in a program. The inner 
loops’ memory access behavior is indicative of the entire program behavior since 
instructions executed in the inner loops often account for more than 98% of the 
memory reference instructions executed by a program. 

Figure 5 shows the miss rate reduction per loop for the apsi benchmark when 
using either a 1KB stream buffer or a 1KB victim cache for a given loop. For 
some loops the victim cache reduces the miss rate much more than the stream 
buffer, whereas the opposite is true for other loops. 

Figure 6 shows the miss rate reduction compared to a normal cache hierarchy 
for different instantiations of the loop at line 276 from file jidcting.c in the ijpeg 
benchmark when using a 1KB victim cache or stream buffer. The miss rate 
reduction varies a lot between loop instantiations, with some instances preferring 
a victim cache and others preferring a stream buffer. 

The miss rate reduction when using a cache assist varies widely between dif- 
ferent loops, and between instantiations of the same loop. We can now conclude 
that cache assist adaptivity is not only desirable at the program level, but it 
should also be applied dynamically within a program. 

5.2 Dynamic Combination of Cache Assist Techniques 

So far we discussed using the cache assist memory either as a stream buffer or as 
a victim cache. Given the fact that few program exhibit pure temporal locality or 
spatial locality, but rather a mix of them, one can expect that using both cache 
assists at the same time would have a better performance. To take advantage 



% Miss rate reduction % Miss rate reduction 
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Fig. 4. Miss reduction rate for a 1KB cache assist 




Fig. 5. Miss rate reduction per loop (a 1KB assist, the apsi benchmark) 
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Fig. 6. Miss rate reduction per loop instantiation in ijpeg benchmark 



of the facts presented above a program could change the cache assist structure 
either initially or before entering a loop so that it is either a victim cache or a 
stream buffer, depending on what configuration results in a lower miss rate. The 
question is, given limited cache assist memory, what is the best way to partition 
it. 

To investigate different possibilities of adaptation we propose four approaches 
to partitioning the total (limited) cache assist space between the victim cache 
and the stream buffer. They are: 

1 . Use the entire cache assist memory either as a victim cache or a stream buffer, 
changing the use for each loop (the dynaJoop approach). The decision to 
use one configuration or the other is taken based on which achieves a greater 
miss removal rate for that loop. The miss reduction information comes from 
profiling. 

In the dynaJoop case the cache assist can be used either as a stream buffer 
or as a victim cache for any loop. We conjecture that splitting the cache 
assist and using a part of it as a stream buffer, and another as a victim 
cache would further improve the performance. The following three strategies 
use this kind of partitioning. 

2. Partition the cache assist memory between the victim cache and the stream 
buffer in the same ratio as the miss reduction rate of the victim cache and the 
stream buffer for the whole program (the part-buf approach). The partition 
is fixed for the duration of the program. 

3. The dynaJ)uf approach partitions the cache assist memory between the vic- 
tim cache or stream buffer per inner loop, proportionally to the miss removal 
rate ratio of victim cache and stream buffer for that loop. 
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4. The half-buf approach uses one half of the cache assist memory as a victim 

cache, and the other half as a stream buffer for the whole program. 

dynaJoop and dyna-buf are dynamically adapting the cache assist configura- 
tion whereas part_buf and half-buf are not adaptive approaches, they are studied 
for comparison. 

Figure 7 shows the miss reduction rates in the dynaJoop case. Profiling in- 
formation gathered in the experiments summarized in Fig. 5 is used to choose 
the cache assist as a stream buffer or as a victim cache for each loop. The perfor- 
mance improvement compared to the best of either a stream buffer or a victim 
cache for the entire program ranges from 25% to 49%. Thus adaptivity improves 
performance when performed at loop level. However the miss rate got marginally 
worse for fpppp (decreased from 63.54% to 62.27%). Almost all memory accesses 
(98%) are executed inside one loop, and for this loop the cache assist is config- 
ured in the optimal way, the loss of performance comes from the other loops in 
the program. 

For the programs in which the stream buffer has a very small improvement 
as compared to a victim cache {go, perl) the additional miss reduction rate is 
minimal because any possible gain from using a stream buffer is minimal. 

The results for part_buf appear in Figure 8. With the exception of fpppp all 
the benchmarks show gains when compared to just using victim cache or a stream 
buffer. Fpppp’ s loss is determined by the fact that its most dominant loop would 
need a bigger victim cache than what the part_buf approach allocates. However, 
the degradation is again minimal, a 2% decrease in miss rate reduction. 

Figure 9 shows the results for the dyna_buf It improves the miss ratio by 
32% for turbSd, 43% for apsi, 53% for wave, and 51% for ijpeg. All the results 
are better than the case of using just a stream buffer or a victim cache, except 
for fpppp (see an explanation for Fig. 7). The improvement is minor for the 
benchmarks that show very little improvement from using a stream buffer. 

Finally, the half_buf approach uses one half of the assist cache memory as 
victim cache and the other half as stream buffer for the whole program. This is 
not a dynamic approach, but it is used for comparison with the dynaJoop and 
dyna_buf approaches. The results are shown in Fig. 10 as relative percentage 
improvement over the miss rate reduction for the half_buf approach using the 
formula: 



missjratejreduction{dyna) — missjratejreduction{half) 
missjratejreduction{half) 



* 100.0 



( 2 ) 



The half_buf configuration marginally outperforms the dynaJ)uf configura- 
tion for apsi and turbSd It is significantly outperformed for fpppp, go, ijpeg and 
perl by up to 28%. 

We can correlate this result with the experiments using the cache assist just 
as stream buffer or victim cache. It shows that the dynaJ)uf configuration out- 
performs the half-buf configuration in the cases where the victim cache performs 
clearly better than the stream buffer. 
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The dynaJoop configuration noticeably outperformed by half-buf in two 
cases, apsi and ijpeg, by up to 14%. It outperforms half_buf by 15 to 26% in 
3 cases: fpppp, go and perl. Thus dynaJoop is not always a win. 

The Figure 11 compares the miss rate reduction for all the techniques pre- 
sented. Because in the previous paragraph we have compared the fixed size, 
non-reconfigurable cache assist half-buf with the reconfigurable approaches we 
are not going to repeat that comparison here. For fpppp the performance of us- 
ing the cache assist as a victim cache is marginally better than any adaptive 
approach, but this does not happen for any other benchmark. The programs 
that show high miss reductions rates from using a victim cache, but very low 
from using a stream buffer (swim, go, perl) get only minimal benefits from any 
of the proposed adaptive schemes. Dyna_buf consistently outperforms dynaJoop 
except for swim and fpppp. It also outperforms partJuf with the exception of 
ijpeg where the difference is negligible. Thus, the most dynamic approach, the 
dynaJuf is the best. Adapting the cache assist configuration is most helpful in 
cases when both a victim cache and a stream buffer individually show noticeable 
improvement. 



T3 

o 




swim turbSd apsi fpppp wave go ijpeg peri 



□ vc nsb □ dynaJoop 



Fig. 7. Miss rate reduction for dynaJoop 



5.3 The Effect of Cache Assist Buffer Size 

The overall size of cache assist memory is an important parameter, the effective- 
ness of adaptation may depend on it. The miss reduction rate for a 256B cache 
assist memory is shown in Figure 12. Compared to a 1KB cache assist memory 
in Fig. 10 one can see that for the small cache assist the dyna_buf approach 
is a win in all but one case, while only in two cases the performance decreases 
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□ VC n sb □ part_buf 



Fig. 8. Miss rate reduction for part_buf 
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Fig. 9. Miss rate reduction for dyna_buf 
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Fig. 10. DynaJoop and dyna_buf performance relative to halLbuf 
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Fig. 11. Miss rate reduction for all the configurations 
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as compared to the half-buf approach. Therefore, wheir adaptive cache assist 
memory space is smaller the adaptive cache assist improves performairce more 
than it does wheir the cache assist memory is larger. 
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Fig. 12. DynaJoop and dyna_buf performance relative to halLbuf for a 256B 
cache assist 



5.4 Compiler Support 

We have shown that adaptivity of a cache assist can help reduce miss rates of 
programs. Furthermore, we have shown that changing the configuration of the 
cache assist at the point of entry in an inner loop is an excellent way to reduce 
the miss rate. This approach is amenable to compiler support. The compiler can 
determine via static analysis or via profiling feedback the optimal configuration 
of the cache assist for a specific loop, and it can insert the corresponding instruc- 
tions at the beginning of the loop. This is the approach we advocate and we are 
pursuing static analysis in our compiler work. The profiling approach was used 
in this study. 

6 Conclusions and Future Work 

We have studied a memory configuration consisting of a standard cache hierarchy 
plus a small cache assist memory that can be used either as a stream buffer or a 
victim cache. The cache assist is reconfigurable at run time to allocate a certain 
fraction of memory to victim cache and/or to stream buffer. 

We have shown that using a cache assist reduces the miss rate of the cache 
and that adapting the configuration of the cache assist reduces it even more. 
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Several approaches have been studied and we have concluded that an approach 
that reconfigures the cache assist per inner loop at run time achieves best per- 
formance. Using a 1KB adaptive assist memory, up to 50% additional miss rate 
reduction is achieved by the best of the proposed methods. Simple static assist 
memory partitioning, on the other hand can suffer up to 15% loss of performance. 
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Abstract. Minimizing data communication over processors is the key to 
compile programs for distributed memory multicomputers. In this paper, 
we propose new data partition and alignment techniques for partition- 
ing and aligning data arrays with a program in a way of minimizing 
communication over processors. We use skewed alignment instead of the 
dimension-ordered alignment techniques to align data arrays. By devel- 
oping the skewed scheme, we can solve more complex programs with min- 
imized data communication than that of the dimension-ordered scheme. 
Finally, the experimental results show that our proposed scheme has 
more opportunities to align data arrays such that data communications 
over processors can be minimized. 



1 Introduction 

Over the last decade, a great number of researchers paid their attention on maxi- 
mizing parallelism and minimizing communication for a given program executed 
on a parallel machine [1,3,10,11,12,13]. Chen and Sheu [3], Lim et al. [10,11], 
Ramanujam and Sadayappan [12], and Shih, Sheu, and Huang [13] presented 
approaches to analyze data reference patterns on a program with structures 
of nested loops so that the parallelized program can be run on a parallel ma- 
chine in a communication-free manner with some constraints. Furthermore, Lim 
et al. [10,11] tried to maximize parallelism and minimize communication on a 
scalable parallel machine by using affine transformations when a program can- 
not be partitioned in a communication-free manner. For a program running on 
a distributed memory multicomputer, it is not easy to distribute and manage 
partitioned data and computations over processors when affine transformation 
methods addressed in [10,11] are used. In general, their methods are suitable 
for use in shared memory or distributed shared memory multiprocessor systems. 
This is because affine transformation methods are not easy to handle regular data 
distribution, such as block or cyclic block data distribution, and to consider the 
situation of the workload balancing. 
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The distribution of data across processors is of critical importance to the effi- 
ciency of the parallel program in a distributed memory machine. For a good data 
distribution pattern, we should consider the case that it has to allow the workload 
to be evenly distributed over processors so that we can maximize parallelism and 
minimize communication. Recently, a number of researchers developed paralleliz- 
ing compilers that take a sequential program based on automatic partitioning of 
data arrays and computations and generate the target parallelized program for 
a parallel machine [6,8,14,15]. The PARADIGM project [6,14] developed a fully 
automated technique for translating serial programs for efficient execution on 
distributed memory multicomputers. Lee [8] proposed a dynamic programming 
technique to efficiently solve the data redistribution problem among program 
segments based on the approaches proposed in [6,9]. Ayguade, Garcia, and Kre- 
mer [2], and Tandri and Abdelrahman [15] also addressed data redistribution 
techniques for parallel programs with explicitly specifying do/doall loop con- 
structs. 

There are numerous researchers concentrated on the alignment and distribu- 
tion problem [2,6,8,9,15]. Li and Ghen [9] studied the problem of aligning data 
arrays in order to minimize communications. Meanwhile, they also showed the 
alignment problem is NP-complete in the number of data arrays with a loop 
nest. Gupta and Banerjee [6] used the constraint-based method to extend the 
alignment method [9] for the distributed memory multicomputers. Lee [8] solved 
the data redistribution problem among loop nests using a dynamic program- 
ming technique based on Gupta and Banerjee’s method [6]. Previous investi- 
gations [2,15] addressed some approaches to solve the alignment problem by 
considering how to distribute and align data arrays as well as how to preserve 
parallelism on a given program. 

As we know, a large number of researches as mentioned above have paid 
their attention in aligning multiple arrays by using array dimensions to match 
the other array dimensions of different arrays. In contrast to dimension-ordered 
data layouts, Kandemir et al. [7] addressed a linear algebra framework to au- 
tomatically determine the optimal data layouts expressed by hyperplanes for 
each array referenced in a program. That is, determining skewed data layouts is 
important to a program for analyzing data access behavior and exploiting par- 
allelism. The skewed data layouts are useful for banded-matrix operations such 
as in BLAS library [5]. In addition, most data arrays are referenced in a skewed 
manner after applying loop skewing or unimodular transformations [16] to the 
program for extracting maximum parallelism. 

The main aim of this paper is to propose a framework for determining the 
skewed data partition and distribution as well as skewed data alignment for each 
data array accessed on a given source program. This makes the partitioned data 
and program efficient while they are distributed and performed on a multicom- 
puter system. Here the source program is assumed to be a loop nest with explicit 
doall and do constructs [16]. First, we show how to identify a perfect skewed 
data alignment relation between data arrays and the loop nest. Based on these 
relations, a skewed alignment scheme is proposed to align data arrays with a loop 
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nest program so as to minimize communication. The experimental results show 
that our skewed alignment scheme is more efficient than the dimension-ordered 
matching scheme. 

The rest of this paper is organized as follows. Section 2 states the machine 
model and program model used here as well as the strategies of the skewed 
data partition and data distribution. In Section 3, we explore the alignment 
relations between a data array and a loop nest. Furthermore, we propose a skewed 
array alignment scheme to optimize the data alignment problem in Section 4. 
Experimental results are presented in Section 5. Conclusions are summarized in 
Section 6. 

2 Preliminaries 

2.1 Machine Model and Program Model 

Here we give the abstract target machine, a g-dimensional grid of Ni x N 2 x 
■ ■ ■ X Nq (=iV) processors, where q is the maximum dimensionality of any array 
used in the source program, and is less than or equal to the deepest level of the 
loop nest program appeared in the source program. A processor in the q-D (D 
stands for dimensional) grid is represented by the tuple {pi,p 2 , ■ ■ ■ ,Pq), where 
0 < Pi < Ni — 1 for 1 < i < q. Such a topology can be easily embedded into 
most of distributed memory machines, such as hypercubes and tori. 

The program model used here is assumed to be an Fdeep non-perfect loop 
nest containing the explicit doall and do loop constructs. Programs belonging 
to this loop model can be obtained via a sequence of loop transformations used 
in [16]. The parallel program generated from a sequential program corresponds 
to the SPMD (simple program-multiple-data) model, in which each processor 
performs the same program code but operates on distinct data items [6,8,16]. In 
addition, the owner-computes rule is used here. Processor that owns left-hand 
side variable of a statement computes the whole statement [16]. 

2.2 Skewed Data Partition and Distribution 

In this subsection, we introduce how to express the skewed data partition and 
data distribution over processors. We first state the difference of data represen- 
tation between traditional data space and skewed data space for a data array. 
Consider a 2-dimensional array A[i,j] with 0 < z < 3 and 0 < j < 3. From the 
point view of traditional data space representation, the data space of array A 
has two axes i and j. In other words, we can say that the data space is spanned 
with two base vectors (1,0) and (0,1) corresponding to axes i and j, respec- 
tively. Therefore, an element A[z, j] of array A can be represented by the matrix 
representation 

/ = DI, where I = 

which uses the two base vectors as the basis of the data space. Now we intend to 
derive a new data space representation, named skewed data space representation. 



and D = 



1 0 
0 1 
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used in designing our proposed skewed alignment scheme. To illustrate the idea 
of skewed data alignment technique, we use two new vectors si = (1,0) and 
S 2 = (1,1), which are linearly independent, as the basis of a new data space. 
The transformation from the original data space to the new data space can be 
derived as follows. We start evaluating all components of the vectors si and 
S 2 - Let gi and g 2 be their greatest common divisors (gcd) of all components in 
Si and S 2 , respectively. Then, we divide each component by g\ in si to obtain 
the new vector Si as well as divide each component by g 2 in S 2 to obtain the 
new vector s^. In this example, we have, si = Si and S 2 = s^. Now we put the 
two new vectors (1,0) and (1,1) as columns to form a transformation matrix 



D' = 



1 1 
0 1 



so that we have two corresponding new axes i' and j' . Detailed 



derivations are shown in below. 



DI = DT, where V = 



'1 o' 




i 




A 1' 




'r 


0 1 




j 




0 1 




f 


D'-b 


DI = 


- I 2 X 2 I' 




^ I 



. D'-^DI = D'~ 
y-^Di 



^D'r 



where 



D'-^ = 



1 -1 
0 1 



(an inverse matrix of D') and 12x2 = 



1 0 
0 1 



(an identity matrix). 



That is, we have the following index transformation 

i- j 
j 

where —3 < i' < 3 and max(0, —i') < j' < min(3 , 3 — i'). Hence, i' is referred to 
as the axis of the vector si = (1,0) and j' is referred to as the axis of the vector 



S 2 = (1, 1). For example. 



while 



. For general cases of the 



above transformation derivations, we can refer to reference [3] for details. 

With these transformed axes of data arrays, we will discuss how to perform 
skewed data partition and data distribution among processors on a g-D grid 
below. For traditional data distribution, the fc-th dimension of an n-dimensional 
data array A is denoted as Hfc, 1 < fc < n. Here we use vectors to denote the 
dimensions of a data array. Let the original data space be spanned by the base 
vectors Ci for 1 < i < n, where is a 1 x n vector whose components are set to 0 
except for the z-th position set to 1. Each element of array A can be represented 
by such n base vectors. Thus, each array dimension Cfc will be mapped to a unique 
dimension map{ek) of the processor grid, where 1 < map(ek) < g. For skewed 
data partition and distribution, a data array A in general can be represented by n 
row vectors rhk, 1 < k < n, where these n vectors are linearly independent. Thus, 
each array dimension mk will be mapped into a unique dimension map{fhk) of 
the processor grid, where 1 < map(fhk) < g. Here we suppose that each array 
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Fig. 1. Some data distribution schema for a 16 x 16 data array A 



dimension rrik has the corresponding transformed index axis The skewed data 
distribution for dimension fhk of a data array A is of the form 

f ^map(mfc)] if ^ is distributed 

^ if A is replicated 

[ constant or * if A is not distributed, 

where d € { — 1,1} and the part of square parentheses surrounding is optional 
in this expression. Symbol d stands for index increasing or decreasing along 
the direction fhk, which depends on whether d is 1 or —1, respectively. Symbol 
offset indicates displacement for the mapping of the dimension fhk- Symbol 
block indicates the block size for distribution in this dimension. Function /™'“ (*},) 
returns the processor index along the dimension mapiihk) of the processor grid. 
This distribution function extended with skewed data distribution is generalized 
based on the proposed method [8]. Here we show different data distributions 
possible for a 16 x 16 data array A[0..15, 0..15] on a 4 x 4 processor grid as in 
Fig. 1 (a) and Fig. 1 (b) and on a four-processor multicomputer as in Fig. 1 
(c). The block size in these examples is 4. Their corresponding data distribution 
functions are shown below. 

(a) = L^J mod 4 (b) mod 4 

= L^J mod 4 fA'^Hf) = ^ 

(c) = L(J mod4 



3 Data Alignment Relations 

An iteration space of an Fdeep loop nest as an /-dimensional polyhedron where 
each point (iteration) is denoted by an / x 1 column vector / = ■ • ■ ,iiY, 

where t is denoted as the operation of matrix transpose and each ik denotes a 
loop index, \ < k < 1. Here i\ is the outmost loop while ii is the innermost loop 
from outer to inner. Herein, 0* represents an / x 1 column vector where all of 
each component (element) are zero. 



110 Tzung-Shi Chen and Chih-Yung Chang 



In the following, we define a data reference for an n-dimensional array A 
accessed by a statement surrounded by an /-deep non-perfect loop nest. The 
data reference to array A is expressed by A[expi , exp 2 , ■ ■ ■ , expn] where expj, 
1 < j < n is an integer-valued linear expression possibly involving loop index 
variables *i, i 2 , ■ ■ ii- This data reference can also be expressed by MaI + b, 
where Ma is defined as an n x / access matrix, I is the iteration vector, and o 
is an offset (constant vector), an n x 1 column vector [3]. For example, to the 
reference A[i — lA + j] surrounded by a 2-deep loop nest with index variables i 
and j, we have the access representation 



Ma 



i 

j 



+ o, where 



Ma 



1 0 
1 1 



and o = 



-1 

0 



Next, we will explore alignment relations between a certain data array sur- 
rounded by an /-deep loop nest and the k-th outermost loop, 1 < k < 1. 



Definition 1 : [Perfect Data Alignment] 

Suppose that all the iterations on the fc-th outermost loop in an /-deep loop 
nest access the elements of array A along a certain direction (dimension) fh. We 
call the dimension fh of array A as perfect data alignment which is aligned with 
the fc-th outermost loop. 

□ 

By Definition 1, it turns out that if the property of perfect data alignment is 
held, no communication is incurred while distributing all iterations along the k- 
th outermost loop as well as distributing the elements of array A accessed by 
these iterations along dimension m over processors. This is because that the ref- 
erence data and its corresponding computations (iterations) are distributed to 
the same processor. Hence, we have the following theorem based on the concept 
of skewed data layouts presented in [7]. 



Theorem 1 : 

Suppose we have an n-dimensional array reference R surrounded by an /- 
deep loop nest and there exists an n x / access matrix and an offset vector b to 
the array reference R. If there exists the k-th column vector fh\. ^ 0^ in Mr, 
1 < fc < /, aligning the array direction fhk with the fc-th outermost loop is 
referred to as a perfect data alignment. 

Proof: 

The detailed proof can refer to the reference in [4] . 

□ 

For example, consider the following 2-deep loop nest LI. 

do i = 0 to m — 1 /* doall loop */ 
do j = 0 to rn — 1 

B[j + IA] = B[j, i] * A[i + j, j] (LI) 

enddo 

enddo 
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We have the access matrix to the data reference A\i + j, j] 



Ma 



1 1 
0 1 






According to Theorem 1, array A along mi = (1,0) is perfectly aligned with 
the loop i. For instance, elements of array A accessed by the two consecutive 
iterations Ii = (1,1) and = (2,1) to loop i have the relation of the data 
reference direction fh\ = (1,0)*; that is. 



Ma 



2 

1 



Ma 



1 

0 



Apparently, we do gain a good reference pattern, by distributing array A along 
the direction mi = (1,0), while distributing the iterations of loop i over pro- 
cessors. As a result, we are able to make a profit on reducing communication 
while array A is distributed along (1,0) and thus is aligned with loop i. At the 
same time, we do also minimize data communication by distributing array A 
along the direction m 2 = (1,1) while the iterations of loop j is distributed over 
processors. In contrast to the perfect data alignment, we have the non-perfect 
data alignment either aligning the array A along direction (0, 1) with loop i or 
aligning the array A along direction (0, 1) with loop j. Applying the non-perfect 
data alignment to distributing data and iterations generally incurs a lot of data 
communication among processors. 



4 Skewed Array Alignment 

We define a directed graph, called computation- communication alignment graph, 
CCAG, to extract and express the characteristics of a program and data arrays 
accessed by the program. 

Definition 2: [Computation-Communication Alignment Graph] 

A computation- communication alignment graph, CCAG, is a directed graph 
G = {V, E} to a program with an Z-deep loop nest, where C is a set of vertices 
and A is a set of edges, defined below. V is composed of three different types of 
vertices: 

(1) array dimension vertex Afn^. with respect to an array dimension fhk whose 
transpose column vector mj. 0* is the fc-th column vector of the n x I access 
matrix Mu for reference R of an n-dimensional array A, on a statement inside 
an Cdeep loop nest, 1 < k < I, 

(2) loop vertex with respect to loop nest with do (sequential loop) construct, 
and 

(3) loop vertex with respect to loop nest with doall (parallel loop) construct. 

E is composed of three different types of edges: 

(1) a read edge, {Am^, loop k), links array dimension fhk of array A to the 
incurred fc-th outermost loop if this reference R is on the right-hand side of the 
statement. 
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Fig. 2. CCAG for loop nest LI 



(2) a write edge, (loop k, Afn^), links an incurred fc-th outermost loop to array 
dimension fhk of array A if this reference R is on the left-hand side of the state- 
ment, and 

(3) a do edge links loop nesting between two consecutive loops from outer to 
inner if there exists such a loop structure in a program. 

There are weights on read and write edges which, repectively, indicate the num- 
ber of edges loop k) and (loop k, Afn^). 

□ 

Note that if there is no loop nest or more than one consecutive loop nests 
in a program, we assume that there exists an outermost loop surrounding the 
original program. According to Definition 2, if there exists such a loop structure, 
the root of this loop structure with loop level 1 is the outermost loop vertex. 
The constructed loop structure can be labeled from root down to leaves one by 
one numbered with loop levels 1 to L 

As in loop nest LI, we have a CCAG with 4 array dimension vertices, 
^( 1 . 0 )) ^(1,1) ^( 0 , 1 ) j ^nd B(i,o)i and 2 loop vertices, where loop vertex i, in- 
dicated by the gray point, is a doall loop and loop vertex j, indicated by the 
empty point, is a do loop as shown in Fig. 2. An edge of solid arrow line with 
weight one links perfect data alignment dimension to its incurred loop, and vice 
versa. A do edge with dashed arrow line is represented by the loop structure 
linked from outer loop i to inner loop j. 

Apparently, we have a lot of information to be extracted under the con- 
struction of a CCAG. These include what loop (iterations) can be parallelized, 
what array dimension needs to be aligned with some loop, and what array di- 
mension needs to be aligned or matched with the other array dimensions. Each 
partitioned array dimensions can be identified while all of each dimension of the 
array are determined as well. 

We know that the data alignment problem is an NP-complete problem [2] [9]. 
However, in this paper, we use a new kind of graphs to capture the character- 
istics of a program and data arrays as well as intent to optimize the data com- 
munication overhead. Thus, we will address a heuristic alignment scheme in the 
following. To the best of our knowledge, this is the first discussion to explore the 
skewed data alignment problem. 

In the following subsections, we first address how to pick up the alignment 
directions from a given data array to a loop nest. Next, we shall present how to 
align multiple data arrays with the loop nest for minimizing data communication 
overhead. 
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4.1 Alignment for a Data Array 

A heuristic algorithm to efficiently align a data array with a loop nest to mini- 
mize communication over processors based on a given CCAG is described below. 
Now, we discuss how to select data alignment dimensions from a given data ar- 
ray. We have two main phases to align multiple data references to an array: the 
doall-loops phase followed by the do-loops phase. Based on a constructed CCAG, 
we first examine the loop structure with existing doall loops from loop level 1 
to the deepest loop level I in sequence in the doall-loops phase. That is, we give 
high priority to distributing the most external parallel loop from outer to inner 
while distributing computations over processors. This will lead to the largest 
granularity of this program that is distributed to processors. That is, we ex- 
pect to explore the coarse-grained parallelism to a program. This consideration 
will be benefited to the distributed memory machines in general. In the for- 
mer phase, we are concerned with minimizing communication while distributing 
computations along doall loops and distributing data arrays along the selected 
skewed dimensions. This is because we require to distribute the computations 
along doall loops such that these computations can be executed in a parallel 
manner. After that, we use the same operation as the doall-loops phase does to 
proceed with the processing of the do loops in the latter phase. In this phase, 
we examine the loop structure with the existing do loops from loop level 1 to 
the deepest loop level I as well. In the second phase, we are concerned with 
minimizing the residual communication while partitioning and distributing data 
arrays is performed along the selected skewed dimensions in this phase. 

For each phase, we have two sub-phases; the former is for aligning write data 
references with the loop nest, whereas the latter is for aligning read data refer- 
ences with the loop nest. That is, we give higher priority to align write references 
with the loop nest than the read ones. It turns out that the computation distri- 
bution for this loop can meet the owner-computes rule for the generated parallel 
code. 

The algorithm of aligning an n-dimensional data array A with /-deep loop 
nest is described below. Initially, let a set of alignment vectors for array A, Sa, 
be empty, </>, i.e., Sa = 4>- 

Algorithm Skewed- Alignment (A, Sa) 

Begin 

Doall-loops phase: 

Alignment (Type of doall loop. Type of write edge. A, Sa)', 
Alignment(Type of doall loop. Type of read edge. A, Sa)', 

Do-loops phase: 

Alignment (Type of do loop. Type of write edge. A, Sa)', 

Alignment (Type of do loop. Type of read edge. A, Sa)', 

End 

Procedure Alignment (Zoop-type, edge-type, A, Sa) 

Step 1: Set p to 1. 



114 



Tzung-Shi Chen and Chih-Yung Chang 



Step 2: 
Step 3: 

Step 4: 
Step 5: 
Step 6: 



Examine loop level p, 1 < p < /, on loop structure in CCAG of the /- 
deep loop nest. 

Assume there are r reference dimensions to array A connected to loop 
vertices of loop-type in loop level p. For each reference dimension fhk, 
1 < fc < r, we sum up the weights, which are on the edges connecting 
fhk with all of loops of loop-type in loop level p. We sort them according 
to weights and obtain the results, each aligned array dimension m^, 
with the weight uifc, 1 < fc < r, in a decreasing sequence; that is, wi > 
W 2 > ••• > Wr- Suppose when Wa = Wa+i = ••• = Wb, we have 
w'a > i«a+i ^ ^ w'f, where uij. is the total weight of summing up 

the weights, which are on the edges connecting fhk with all the loops 
of loop-type in loop level p + 1 to loop level l,l<a<k<b<r. 

We add the alignment direction one by one from rhi to into the set 
of Sa if the vector is not the linear combination of the vectors in the 
predecessor set of Sa- 

We examine the set of Sa- If the cardinality of the set is not equal to n, 
dimensionality of array A, and p < I, add one to p and goto Step 2; 
otherwise, goto Step 6. 

Complete the construction of skewed data alignment dimensions asso- 
ciated with the loops of loop-type in the loop nest. 



We know that it is possible that the construction of aligned dimension set Sa 
for array A is incomplete when the above algorithm is applied. That is, the 
cardinality of the constructed Sa is less than n, the dimensionality of array A. 
If the cardinality of Sa is still less that n, we add suitable base vectors to 
the alignment set Sa to form n vectors, linearly independent, as a complete 
set of dimensions for skewed data representation. This is because we have to 
use n linearly independent vectors to represent the data space of array A. Using 
these heuristic criteria of selection, the larger the sum of weights to an aligned 
dimension of an array, the more the number of references accessed to the aligned 
dimension of the array is. Therefore, we obtain the best set of aligned dimensions 
to an array in order to minimize communication under such a selection scheme. 



4.2 Aligning Multiple Data Arrays for a Program 

Assume we have K data arrays, Ai, A 2 , . . ., Ak, accessed by the source pro- 
gram. For each data array A^, we can obtain the corresponding set Sai of se- 
lected skewed alignment vectors, 1 < i < A, by applying our proposed skewed- 
alignment algorithm. First, we transform and reduce the CCAG into a reduced 
CCAG, an undirected graph, called RCCAG. Each vertex in the RCCAG is a 
vertex, which was obtained from the set of SAi for each array A^. The set of 
edges in RCCAG is composed of edges connecting loop vertices to the array 
dimension vertices in SAi for each array A^. Next, we transform the RCCAG 
into a CAG (component affinity graph) in [9]. Finally, we use the alignment 
algorithm [9], maximum-weight bipartite graph matching, to solve our perfect 
alignment problem to the transformed CAG. 
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(a) Case 1 of transforming 
RCCAG to CAG. 







(b) Case 2 of transforming 
RCCAG to CAG. 



Fig. 3. Two cases of transforming RCCAG to CAG 



In what follows, we present how to translate the RCCAG into CAG graph. 
The vertices in RCCAG are as the vertices in the transformed CAG. The undi- 
rected edges in the transformed CAG are generated by one of the following two 
cases. 

Case 1: Suppose there is a weight wi which is the summation of weights on 
all edges connecting array dimension vertex with a loop ver- 

tex i, whatever do or doall loops for array A. Also, suppose there 
is a weight W 2 which is the summation of weights on all edges connect- 
ing array dimension vertex with the loop vertex i for the other 

array B. There exists an undirected edge, connecting with B^^^ 
with having weight ici -I- W 2 on the edge as shown in Fig. 3 (a). 

Case 2: Suppose there are weights wi^i, . . .wi^k which are the summation of 
weights on all edges connecting array dimension vertex with the 
corresponding loop vertices ii, . . . ,ik, whatever do or doall loops for 
array A. Also, suppose there are weights W 2 ,i, ■ ■ ■W 2 ,k which are the 
summation of weights on all edges connecting array dimension ver- 
tex B ^2 with the corresponding loop vertices ii, . . . ,ik, whatever do 
or doall loops for the other array B. There exists an undirected edge, 
connecting A^^ with Bffi 2 , with having weight + W 2 ,u) on 

the edge as shown in Fig. 3 (b). 

The weight on each edge in the transformed CAG is represented as the im- 
aligned penalty. That is, the larger the weight of an edge is, the more both 
the array dimensions need to be aligned. Therefore, we can use the approach, 
maximum- weight bipartite graph matching [9] , to solve our alignment problem. 
For a given program, we can perform this scheme to transform the data arrays 
with the new axes and to explore the skewed alignment relations among these 
data arrays. 

Now we demonstrate how to align arrays with the loop nest LI. The rela- 
tions, CCAG for LI, among program segment and arrays A and B are shown 
in Fig. 2. Via selecting the perfect data alignments for each array, we have the 
sets Sa = {rhi = (1,0), = (1,1)} and Sb = {rhf = (0,1), mf = (1,0)}. 
Thus, the RCCAG is constructed from CCAG of loop nest LI while all of directed 
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'■,-^(0,1) ^ 






Fig. 4. The transformed CAG for RCCGA and alignment between arrays A 
and B 



edges were transformed to undirected edges. Therefore, we have the transformed 
GAG and two matched dimension partitions Pi and P 2 as shown in Fig. 4. Ap- 
parently, A(i 0 ) is aligned with P(o,i) and A(i 1 ) is aligned with P(i,o) when the 
proposed approach [9] is applied to this transformed GGA. 



5 Experimental Results 



Based on the skewed data alignment technique, we are going to discuss how 
to determine the data distribution and block size for each data array with the 
loop nest LI. This example is written by hand to the parallelized versions. Here 
assume N processors with 1-D grid machine used in our experiments. We first 
demonstrate how to determine data partition and block size with the loop nest LI 
here. By applying our proposed scheme, A(i qj is aligned with P(o,i) while A(i 1 ) 
is aligned with P(i,o) for LI. In LI, loop i is a doall loop, which can be executed 
in parallel. Glearly, we distribute loop i to N processors in block while loop j is 
sequentially executed without the need of distribution. As derived from Section 2, 
we have the following index transformations. 



Ja 



i- 3 
3 



for array A[i,j], 0 < i < 2m 



1 and 0 < j < m — 1, and 



for array B[i,j], 0 < i < m and 0 < j < m — 1. 

Thus, we have the following skewed data distribution for array A as shown in 
Fig. 5 (a) and for array B as shown in Fig. 5 (b) with block size of provided 
the assumption that m is divisible by N. 

(a) /a mod N (b) mod N 

With such data and computation distributions over processors, the loop nest LI 
can be executed in parallel in a communication- free manner. 

For the loop nest LI, the experimental results are shown in Table 1. In this 
experimental study, we implement two versions of parallel codes on a 32-node 
nGUBE-2 multicomputer: the first, called SA, designed by our proposed skewed 
alignment scheme and the second, called GAG, designed by the GAG’s alignment 



Js 
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Fig. 5. Optimum skewed data partition and distribution for loop nest LI on 1-D 
grid 



scheme where ^(i.o) is aligned with S(o,i) while ^(o,i) is aligned with i3(i q)- In 
Table 1, the problem size is represented by m and the number of processors is 
represented by N. Here, the block size of data distribution for arrays A and B 
is ^ . Obviously, the running performance using our proposed skewed alignment 
scheme is superior to the traditional dimension-ordered alignment scheme. The 
major reason is that the parallel version implemented by the skewed alignment 
scheme is running on processors in a communication-free manner for this sample 
example. 



Table 1. Execution time in seconds for loop nest LI running in parallel on a 
32-node nCUBE-2 multicomputer 



m 


N 


= 2 


N 


= 4 


N 


= 8 


N 


= 16 


N 


= 32 




SA 


CAG 


SA 


CAG 


SA 


CAG 


SA 


CAG 


SA 


CAG 


2^ 


0.43 


1.02 


0.23 


1.13 


0.13 


1.15 


0.06 


1.88 


0.03 


2.43 


2^ 


0.81 


1.82 


0.45 


1.03 


0.27 


1.36 


0.14 


1.57 


0.05 


1.89 




1.65 


3.85 


0.87 


2.91 


0.47 


2.46 


0.28 


2.97 


0.16 


3.08 




3.23 


8.76 


1.65 


5.54 


0.88 


5.02 


0.46 


4.89 


0.27 


4.76 




6.43 


15.87 


3.23 


12.11 


1.65 


11.89 


0.87 


11.03 


0.46 


10.55 



We can see that data arrays A and B are perfectly aligned with each 
other. Assume we align arrays A and B using the traditional dimension-ordered 
alignment. That is, either q) is aligned with q) while A(gp) is aligned 
with L(o_i), or A(i q) is aligned with L(g,i) while A(g_i) is aligned with L(i_g). 
Both of them might incur a great amount of communication while these data 
arrays are distributed among processors whatever block or cyclic data distribu- 
tion are used. This is because there does not exist any perfect data alignment 
between arrays A and B when the dimension-ordered alignment scheme is used. 

In most of parallelizing compiling systems, they use some powerful compiling 
techniques as tools to explore loop parallelism and data locality for improving 
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the performance of executing programs. The techniques used, in general, in- 
clude loop skewing, loop interchange, loop tiling, unimodular transformation, 
and more [16]. Our skewed scheme can be combined with most of the powerful 
compiling techniques together, such as loop skewing, loop interchange, and so 
on. The proposed scheme can automatically detect and extract program par- 
allelism as well as distribute data arrays over processors so that the incurred 
communication can be minimized. 

6 Conclusions 

In this paper, we proposed a heuristic alignment technique for efficiently align- 
ing data arrays so as to minimize communication over multicomputers. We used 
skewed alignment instead of the traditional techniques with dimension-ordered 
alignment to align data arrays. With this development of skewed alignment 
scheme, we can solve complex programs having minimized data communication 
more efficiently than that of the dimension-ordered scheme. Finally, we show 
that our proposed scheme is outperformed over the scheme proposed by the 
previous work as dimension-ordered data distribution used. 
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Abstract. The M-Machine’s combined hardware-software shared-memory 
system provides significantly lower remote memory latencies than software 
DSM systems while retaining the flexibility of software DSM. This system 
is based around four hardware mechanisms for shared memory: status bits on 
individual memory blocks, hardware translation of memory addresses to 
home processors, fast detection of remote accesses, and dedicated thread 
slots for shared-memory handlers. These mechanisms have been implement- 
ed on the MAP processor, and allow remote memory references to be com- 
pleted in as little as 336 cycles at low hardware cost. 

1 Introduction 

Distributed Shared-Memory (DSM) systems use a variety of methods to implement 
shared-memory communication between processors. Some designers provide substan- 
tial hardware support for shared memory, such as hardware protocol engines [1] or ded- 
icated co-processors [10]. Other systems rely completely on software to implement 
shared memory [13]. Hardware protocol engines can give very low remote memory la- 
tencies, but increase the complexity and hardware cost of the system. In addition, they 
restrict the system to one shared-memory protocol, limiting performance on applica- 
tions whose communications patterns do not match the assumptions of the shared-mem- 
ory protocol. Software-based shared-memory is very flexible, and does not increase the 
cost of the system, but tends to have relatively poor performance due to the overheads 
imposed by conventional networks and virtual memory systems. Providing co-proces- 
sors to execute shared-memory handlers gives both flexibility and speed, but the hard- 
ware cost of this approach is substantial. 

In this paper, we present an alternative approach to implementing DSM systems, 
based around four hardware mechanisms for shared memory that are integrated into the 
processor itself, substantially improving shared-memory performance at low hardware 
cost while retaining the flexibility of software-based approaches. Shared-memory pro- 
tocols share several common features, which can be exploited in the design of hardware 
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mechanisms to accelerate shared memory. They must be able to detect references to re- 
mote memory, determine where the request for the remote memory must be sent, trans- 
fer data back to the requesting processor, and complete operations which are waiting for 
the data. In addition, it is desirable that an architecture support transfers of small blocks 
of data between processors to reduce false sharing (unnecessary remote memory oper- 
ations caused by placing two or more unrelated data objects within a block of memory 
that the shared-memory system treats as an atomic unit), and that the architecture allow 
user programs to continue executing during remote memory references. 

Based on these requirements, we have designed four hardware mechanisms for 
shared memory, which have been implemented as part of the MIT M-Machine project 
[5]. Block status bits on each eight-word block of memory allow individual blocks to 
be transferred between processors. A fast event system detects remote memory refer- 
ences and invokes software handlers in as little as 10 cycles. A Global Translation 
Lookaside Buffer caches translations between virtual addresses and their home proces- 
sors. Dedicated thread slots for software handlers eliminate context switch overhead 
when starting handlers, and allow user programs to execute in parallel with shared- 
memory handlers. 

Using all of our mechanisms allows system software to complete a remote memory 
reference in 336 cycles on the M-Machine, almost 20x faster than most software-only 
shared-memory systems, and only 2.5x slower than current-generation hardware 
shared-memory systems such as the SGI Origin 2000 [11]. On applications, we achieve 
a 9 % performance improvement on a latency-bound FFT computation and a 30% im- 
provement on an occupancy-bound multigrid computation, as compared to the perform- 
ance of our system without these hardware mechanisms [4]. 

The remainder of this paper begins with a brief overview of the MIT M-Machine, 
followed by a description of the mechanisms and their use in implementing shared 
memory. We continue with an analysis of the performance impact of our mechanisms 
for shared memory, followed by a discussion of related work and some future research 
directions. 

2 The M-Machine 

The M-Machine Multicomputer is an experimental multicomputer that we have de- 
signed at M.I.T. and Stanford to explore architectural techniques to take advantage of 
improvements in silicon fabrication technology. An M-Machine consists of a two-di- 
mensional array of processing nodes, each of which consists of a custom Multi-ALU 
processor (MAP) and five SDRAM chips. The MAP chip has been fabricated in a 0.5- 
micron process, and work is ongoing as of July, 2000 on a prototype M-Machine. 

As shown in Figure 1, each MAP chip contains three processor clusters [8], two 
cache memory banks, and a network subsystem. Each of the clusters acts as an inde- 
pendent, multithreaded processor. The instruction issue logic in each cluster imple- 
ments zero-overhead multithreading between the five active threads in the cluster, se- 
lecting an instruction to issue each cycle based on operand and resource availability. 
Threads running in the same thread slot (hardware registers which hold program state) 
on each of the clusters are assumed to be part of the same job, and may use the cluster 
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switch to write into each other’s register files. Memory addresses are interleaved be- 
tween the two cache banks, allowing two memory operations to be completed each cy- 
cle. 




Figure 1: MAP Chip Block Diagram 

The MAP chip’s architecture allows it to exploit parallelism at multiple granulari- 
ties. Each cluster acts as a 2- or 3-wide LIW processor^ allowing fine-grained instruc- 
tion-level parallelism to be exploited within a cluster. The MAP chip’s inter-cluster 
communication mechanisms [9] provide low-latency communication between threads 
running on different clusters, making it feasible to exploit medium-grained parallelism 
within a MAP chip. Finally, the on-chip network hardware [ 1 2] provides fast, user-level 
messaging between processors in an M-Machine, allowing coarser-grained parallelism 
to be exploited across multiple processors. 

Shared memory is implemented on the M-Machine using a combination of hard- 
ware and software. Hardware detects remote memory operations and passes them to 
software to resolve. Remote memory references are enqueued in software while their 
data is being obtained, and are then resolved by the shared-memory handlers. Complet- 
ing pending operations in software is made easier by the MAP chip’s configuration 
space, a mechanism which maps all of the register state of the chip into an address space 
that can be accessed using normal load and store operations, relying on the Guarded 
Pointers [3] protection scheme to prevent unauthorized programs from modifying other 
program’s register states. This allows the system software to write the result of each 
pending operation directly into its original destination register, making software reso- 
lution of remote memory requests transparent to user programs. 



1. The original design called for each cluster to have 3 ALUs: one integer, one memory, 
and one floating-point. Chip-space constraints forced the removal of the FP ALUs from 
two of the clusters during implementation. 
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3 Mechanisms for Shared Memory 

We have designed four hardware mechanisms for software shared memory: block status 
bits, a fast event system, a global translation lookaside buffer, and dedicated thread slots 
for shared-memory handlers. Together, these mechanisms implement several important 
sub-tasks of most shared-memory protocols in hardware, freeing the software handlers 
to implement higher-level DSM policies. Block status bits allow 8-word blocks of data 
to be transferred between processors, reducing false sharing. The event system detects 
remote references and invokes software handlers to resolve them. The Global Transla- 
tion Lookaside Buffer provides a flexible mapping of addresses to home processors, al- 
lowing data to be mapped for maximum locality. Dedicated thread slots for software 
handlers eliminate context switch overheads when invoking software handlers, reduc- 
ing the remote reference time. 

3.1 Block Status Bits 

The block status bits allow small blocks of data to be transferred between processors by 
associating two bits of state with each 8-word block of data on a node. These bits encode 
the states invalid, read-only, read-write, and dirty, and the hardware enforces the per- 
missions represented by these states. Block status bits eliminate much of the false shar- 
ing which occurs in software-only shared memory systems because conventional virtu- 
al memory systems are unable to record presence information on units of data smaller 
than a page. They are stored in the page table and copied into the local TLB (LTLB) 
and the cache when a block is referenced, allowing remote data to be stored at all levels 
in the memory hierarchy. 

During a memory reference, the memory system checks the block status bits of the 
referenced address to determine if the operation is allowed. This check is done in par- 
allel with hit/miss determination in the cache or LTLB and therefore does not impact 
memory latency. If the block status bits allow the operation, it is completed in hardware. 
Otherwise, the event system starts a software handler to resolve the operation. Imple- 
menting block status bits requires 1KB of SRAM in the LTLB, and 0.25 KB of SRAM 
in the caches. 

3.2 The Event System 

The event system is responsible for invoking software handlers in response to remote 
memory accesses and other events which require intervention by system software. 
When the hardware detects a situation which requires software intervention, such as a 
remote memory reference, it places an event record describing the situation in a 128- 
word hardware event queue. It then discards the original operation, allowing programs 
to continue to execute while the event is resolved. A dedicated event handler thread 
processes the event records and resolves events. 

The head of the event queue is mapped onto a register in the event handler’ s register 
file. This speeds up queue accesses and provides a low-overhead mechanism for block- 
ing the event handler when there are no pending events. If the event queue is empty, the 
register for the head of the queue is marked empty by the scoreboard logic, preventing 
instructions that read the register from issuing. When the event handler tries to read the 
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head of the event queue to begin processing the next event, the instruction stalls while 
the queue is empty, but issues as soon as an event record is placed in the queue, allowing 
the event handler to respond quickly to events without consuming execution cycles that 
could be used by other threads in polling. 

3.3 Dedicated Thread Slots for Shared-Memory Handlers 

The M-Machine takes advantage of the MAP chip’s multithreaded architecture to elim- 
inate context switch overhead from software handlers by dedicating a set of thread slots 
to software handler threads. Since the handler threads are always resident in their thread 
slots, there is no need to perform a context switch when a handler executes, in contrast 
to single-threaded implementations of software shared memory. 
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Figure 2: Thread Slot Assignments on the MAP Chip 



Figure 2 shows the thread slot assignments on the MAP chip when shared-memory 
programs are being run. Threads which are involved in implementing shared memory 
are shown in italics, while thread slots which include special hardware are underlined. 
The event and message handlers process the events which occur when a remote memory 
reference is made as well as the various messages which are used to implement the 
shared-memory protocol. In addition, two thread slots are used for proxy threads, which 
are used to break potential deadlock situations that occur when the shared-memory sys- 
tem needs to send three sequential messages over the MAP chip’s two network priori- 
ties. 

Allocating thread slots to software handlers significantly improves the M-Ma- 
chine’s remote access time, and simplifies the design of the software handlers because 
a handler is never interrupted to allow another handler to execute. The main incremental 
cost of dedicating thread slots to handlers is the 1.25KB of storage required to hold the 
register files of the dedicated threads, since the substantial hardware complexity in- 
curred by multithreading is required by the MAP chip’s base architecture. 

3.4 The Global Translation Lookaside Buffer 

Determining how data will be mapped across the processors in a DSM is an important 
part of program implementation. Mapping data for maximum locality can substantially 
reduce the number of remote references made by a program, and thus improve perform- 
ance. However, providing flexibility in address mapping increases the latency of soft- 
ware shared memory, as the shared-memory handlers must translate each remote refer- 
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ence to find the home processor of the reference, creating another area where a small 
amount of hardware support can significantly improve performance. 

The Global Translation Lookaside Buffer (GTLB) acts as a cache for translations 
between virtual addresses and their home processors, similar to the way a normal TLB 
caches translations between virtual and physical addresses. The format of a GTLB entry 
(Figure 3) allows each entry to map a variable-sized group of pages across variable- 
sized regions of the machine. The data mapped by a GTLB entry is specified by a base 
address and a size field, which specifies the number of pages in the page group. The re- 
gion of the machine that the page group is mapped across is specified by its start node, 
the X- and Y-extents of the region (in the 2-D network), and the number of contiguous 
pages mapped per node. All fields except the base address and the start node are loga- 
rithmically encoded to reduce space. 



Base Address 


Size 
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X Extent 


Y Extent 


Pages Per 
Node 



Figure 3: GTLB Entry Format 

The GTLB allows substantial flexibility in mapping addresses. For example. Figure 
4 shows the three ways in which 16 pages of data can be mapped across a 2x2 block of 
processors. Note that changing the value of the start field allows the pages to be trans- 
lated across the machine, facilitating space sharing of a multiprocessor. 

The GTLB’s entry format allows it to be implemented in very little hardware. The 
MAP chip implements a 4-entry GTLB due to space constraints (a 16-entry GTLB was 
specified in the initial design), which requires 64 bytes of content-addressable memory. 
For the experiments run for this paper, only two GTLB entries were required — one to 
map the code segment locally on each processor, and one to map the data segment 
across the entire machine. 
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Figure 4: GTLB Mappings 
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4 Using the Hardware Mechanisms to Implement 
Shared Memory 

Figure 5 shows the steps involved in completing a remote memory reference on the M- 
Machine when all of our hardware mechanisms are used. On cycle 1, a user program 
issues a load or store which references a remote address. By cycle 10, the event system 
has determined that the referenced block is remote and has started the event handler to 
resolve the event. By cycle 33, the event handler has decoded the type of the event and 
jumped to the correct routine to resolve it. 

On cycle 49, the event handler completes the computation of the configuration 
space address which will be used to resolve the original load or store, and executes a 
GPRB operation to probe the GTLB for the home processor of the requested address. 

In cycles 63-111, the event handler creates a record describing the remote operation 
and enqueues it in the software pending operation structure that records all remote 
memory references being completed in software. It then sends the request message to 
the home node and terminates. The request message must be sent after the pending op- 
eration data structure has been unlocked to avoid a potential deadlock with the reply 
handler. 
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Figure 5: Remote Request Timeline 

The request message arrives at the home processor on cycle 116. The next 50 cycles 
are spent locking data structures, to ensure that no other thread modifies the state of the 
referenced block while the request handler is executing. On cycle 186, the eviction of 
the home processor’s copy of the block begins, which completes on cycle 237. The re- 
ply message containing the requested block is sent on cycle 241. 

On cycle 246, the reply message arrives at the requesting processor. Block installa- 
tion begins on cycle 266. On cycle 315, the installation completes, and the reply handler 
begins to resolve the load or store that caused the remote reference, completing the orig- 
inal operation on cycle 336. 
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5 Evaluation 

A three-hop cache-coherence protocol was implemented in software on the MSIM sim- 
ulator to evaluate our mechanisms for shared memory. MSIM is a C-language model of 
the M-Machine that gives execution times within 10% of those given hy the cycle-ac- 
curate RTL model of the MAP chip on our verification test suite. The model of the MAP 
chip used for these experiments has a 128-entry, two-way set-associative LTLB as well 
as floating-point units in all three clusters, restoring features that were removed late in 
the implementation process due to area constraints. The shared-memory handlers use 
the floating-point registers as temporary storage and perform constant generation in the 
floating-point units, making them relevant for this study. 

The handlers used for these experiments were implemented in hand-coded assem- 
bly language. Four versions of the handlers were written, one which uses all of the hard- 
ware mechanisms, one which only uses the block status bits, one which uses the block 
status bits and the GTLB, and one which uses the block status bits and the dedicated 
thread slots for software handlers. Versions which did not use the GTLB included a 
software address translation routine, while versions which did not use the thread slots 
simulated context switches when starting or exiting handlers. 

5.1 Remote Access Times 

Figure 6 shows the M-Machine’ s remote access time as a function of the set of mecha- 
nisms used. The column labelled “Full M-Machine Mechanisms’’ shows the remote ac- 
cess time when all of the hardware mechanisms are in use, measured from the cycle on 
which the processor issues a load to the cycle on which an instruction which uses the 
result of the load issues. Proceeding to the left, the columns show the remote access 
time if various subsets of the mechanisms for shared-memory are used. The leftmost 
column shows an estimate of the remote access time if none of the MAP chip’s mech- 
anisms are used, while the rightmost column shows the remote access time of an 8-proc- 
essor SGI Origin 2000 [11] for comparison purposes. All of the columns which show 
measured data from the M-Machine have been subdivided into the time spent in the 
event handler on the requesting processor, the request handler on the home processor, 
and the reply handler on the requesting processor. 

Based on block transfer programs written for the M-Machine, we estimate that us- 
ing the block status bits reduces remote access time by approximately 1000 cycles when 
only one block of a page is required. This estimate was used to generate the leftmost 
column of Figure 7. If a program requires fine-grained sharing of data, this latency re- 
duction can significantly improve performance. For programs with more coarse- 
grained data sharing, using software to implement shared memory allows the block size 
of the protocol to be increased to match the needs of the application. 

If the GTLB is added to the block status bits, remote access time is reduced to 427 
cycles, an improvement of 18%. Using the block status bits and the dedicated thread 
slots for software handlers has similar results, reducing the remote access time to 433 
cycles (17% improvement). Combining the block status bits, the GTLB, and the dedi- 
cated thread slots for shared-memory handlers so that all of the M-Machine ’s hardware 
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Figure 6: Remote Access Times 

mechanisms are in use gives a remote access time of 336 cycles, a 35% improvement 
over the block status bits alone. 

The mechanisms affect the execution time of different handlers in the shared-mem- 
ory protocol, which has a significant impact on program performance, as will be shown 
later. When the GTLB is used, only the execution time of the request handler changes, 
since the determination of the requested address’ home processor is only done once, in 
the request handler. On the other hand, using the dedicated thread slots affects the exe- 
cution time of both the event and the request handlers, as it eliminates context switch 
overhead from all of the handlers. The execution times for the reply handlers shown on 
this graph do not change when the dedicated thread slots are used because execution 
time is measured from the point at which the first instruction of a handler executes and 
the context switch at the end of the reply handler does not affect the total latency. 

Figure 7 shows how the remote access time changes when one or more invalida- 
tions are required to complete a remote reference. In the protocol implemented for these 
experiments, the home processor does not send an invalidation message to itself when 
it needs to invalidate its copy of a block, so the points shown on the graph represent 2- 
, 4-, and 8-way sharing of data. All versions of the protocol see an increase in remote 
access time of approximately 50% when an invalidation is required to complete a re- 
mote memory reference. As the number of invalidations required to complete the refer- 
ence increases, all of the protocols see a linear increase in remote memory latency, be- 
cause the bottleneck is the time required to process the acknowledgement message from 
each invalidation on the requesting processor. 

Use of the GTLB produces a constant improvement in remote access time, inde- 
pendent of the number of invalidations, while the dedicated thread slots give an im- 
provement which increases with the number of invalidations. Again, this is due to the 
fact that the determination of the home processor of an address is only done once. In 
contrast, the use of dedicated thread slots eliminates the context switch overhead from 
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Figure 7: Remote Access Time With Invalidations 

each handler. As the number of invalidations increases, the number of handlers in the 
critical path of the remote request increases, and therefore the performance impact of 
the dedicated thread slots increases with the number of invalidations. 

5.2 Program Results 

Two programs were simulated to evaluate the impact of the mechanisms for shared 
memory on program execution time: a 1,024-point FFT, and an 8x8x8 multigrid com- 
putation. These programs were based on source code provided by the Alewife group at 
M.I.T. [2], and were written in C with annotations for parallelism. The measurements 
presented here show the execution time of the parallel kernel of each application. 

These programs display very different shared-memory characteristics. In FFT, re- 
mote memory references are relatively infrequent and are fairly evenly distributed 
across the processors. Multigrid makes many more remote references, and spends much 
more time waiting for memory than FFT. More importantly, multigrid’s remote mem- 
ory references are poorly distributed across the processors, because the destination ma- 
trix fits in a single page of memory and is thus mapped onto only one processor. Due to 
this difference in memory access patterns, the shared-memory performance of FFT is 
dominated by the latency of the shared-memory handlers, while the performance of 
multigrid is dominated by the occupancy of the request (priority 0 message) handler slot 
on the hot-spot processor. 

Figure 8 shows the execution time of a 1,024-point FFT on the M-Machine. As 
would be expected from the low demands that this program places on the shared-mem- 
ory system, overall performance is good, achieving better than 4 times speedup on eight 
processors even when only the block status bits are used. Adding either the GTLB or 
the dedicated thread slots for software handlers to the block status bits improves per- 
formance by 2-5%, depending on the number of processors. Adding both of these mech- 
anisms gives speedups almost equal to the speedups provided by each of the mecha- 
nisms independently, reducing execution time by up to 9%. . 

Figure 9 shows the execution time of an 8x8x8 multigrid computation. When just 
the block status bits and the GTLB are used, execution time is up to 20% greater than 
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Figure 8: 1,024-Point FFT 
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Figure 9: 8x8x8 Multigrid 

when all of the M-Machine’ s mechanisms are in use. Using only the block status bits 
increases execution time by up to 30% over the full-mechanism case. 

Interestingly, disabling the GTLB so that only the block status bits and dedicated 
thread slots are used does not significantly increase the execution time of multigrid. In 
fact, performing address translation in software gives a 1% performance improvement 
over using all of the hardware mechanisms when the program is run on eight processors. 
This counter-intuitive behaviour occurs because the GTLB reduces the execution time 
of the event handler on the requesting processor, while the dominant factor in Multi- 
grid’s performance is the occupancy of the request handler thread on the hot-spot proc- 
essor, which is not affected by the use of the GTLB. In fact, using the GTLB increases 
the rate at which requests arrive at the hot-spot processor, increasing the number of re- 
quests which must be returned to the requesting processor because the block they re- 
quest is in the process of being invalidated. These requests must be retried later, increas- 
ing the occupancy of the request handler on the hot-spot processor and decreasing over- 
all performance. 
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6 Related Work 

A number of systems have explored different hardware/software tradeoffs in imple- 
menting distributed shared memory. IVY [13] implements shared memory in software 
through user-level extensions to the virtual memory system. Shasta [16] and Blizzard- 
S [17] rely on the compiler to insert check code before each memory reference to deter- 
mine whether the referenced data is local or remote. Blizzard-E [17] takes a different 
approach, modifying the error correction code (ECC) bits on blocks of data to force 
traps to software when remote blocks are referenced. 

These systems show the advantages of the MAP chip’s base architecture over com- 
modity processors in implementing software shared memory. Blizzard-S and Blizzard- 
E have remote memory latencies of approximately 6000 cycles, while Shasta achieves 
latencies as low as 4200 cycles. In contrast, the M-Machine’s remote memory latency 
is estimated to be approximately 1500 cycles when the event system is the only mech- 
anism in use, giving it a 2.8-4x speed advantage over these software-only systems. This 
speed advantage is due to the MAP chip’s event system and integrated network hard- 
ware, which reduce the time to invoke software handlers and inter-processor communi- 
cation delay. Adding the block status bits, GTLB, and dedicated thread slots for soft- 
ware handlers to the base MAP architecture reduces the remote access time to 336 cy- 
cles, increasing the M-Machine’s advantage to 12.5x-17.8x. 

The Typhoon [14][15] and PLASH [10][6][7] projects explored the use of a dedi- 
cated co-processor to implement shared memory. Typhoon used a commodity proces- 
sor to execute the shared-memory handlers, while PLASH relied on a custom MAGIC 
chip. Typhoon-0, the least-integrated of the systems studied in the Typhoon project, 
was implemented using PPGA technology, and achieved a remote memory latency of 
1461 cycles. Two more-integrated versions of the Typhoon architecture, Typhoon-1 
and Typhoon, were studied in simulation, and had remote memory latencies of 807 and 
401 cycles respectively. With all of its mechanisms in use, the M-Machine has better 
than a 4x advantage in remote memory latency over Typhoon-0, and a 19% advantage 
over the full Typhoon system. Most of this advantage comes from the M-Machine’s su- 
perior network subsystem and low-latency event system. 

PLASH is able to complete a remote memory access in 1 11-145 cycles, depending 
on whether the remote data is cached. The ISA of the MAGIC chip is a major contrib- 
utor to plash’s low memory latency, as it contains many non-standard instructions to 
accelerate shared-memory protocols. In [6], the authors report that at least 38% of pro- 
tocol processor issue slots contain one of these non-standard instructions, suggesting 
that the latency of shared-memory handlers running on MAGIC is significantly reduced 
by the addition of these instructions. 

Comparing the M-Machine to these other systems shows the advantages of integrat- 
ing hardware support for software shared memory into the processor. The M-Machine 
achieves significantly better remote memory latencies than the software-only shared 
memory systems by performing common tasks in hardware. In addition, the M-Ma- 
chine has better remote memory latencies than any of the Typhoon systems, in spite of 
the fact that they utilize substantial custom hardware and a commodity co-processor to 
implement shared memory. While PLASH’ s remote memory latencies are significantly 
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better than the M-Machine’s, much of this is due to the MAGIC chip’s optimized in- 
struction set, creating an opportunity for future work which combines hardware support 
for software shared memory with an optimized instruction set. 

7 Conclusion 

In this paper, we have shown that adding a small set of hardware mechanisms to support 
software shared memory to a processor can significantly improve remote memory la- 
tency at low hardware cost. We have implemented four key mechanisms for software 
shared memory: block status bits to allow small blocks of data to be transferred between 
processors, a fast event system to detect remote memory accesses and invoke software 
handlers, dedicated thread slots to eliminate context switch overhead when starting han- 
dlers, and a global translation lookaside buffer to determine the home processors of re- 
mote addresses in hardware. In combination, these mechanisms allow remote memory 
accesses to be performed in as little as 336 cycles, significantly faster than most soft- 
ware-only or combined hardware/software shared memory systems. Hardware cost of 
these mechanisms is small — approximately 3.5KB of storage, and some control logic. 

Program-level experiments showed that the impact of our mechanisms on program 
execution time depends strongly on whether the dominant factor in the program’s per- 
formance was the latency or the occupancy of the shared-memory handlers. On a 1,024- 
point FFT, in which latency was the dominant factor, using all of the mechanisms im- 
proved performance by up to 9% when compared to using only the block status bits. On 
an 8x8x8 multigrid, which is dominated by the occupancy rate of the request handler on 
the hot-spot processor, the mechanisms improved execution time by up to 30%. 

The mechanisms implemented on the MAP chip substantially improve the M-Ma- 
chine’s shared-memory performance at a low hardware cost. However, the M-Ma- 
chine’s remote access time is still more than a factor of 2.5x greater than that of con- 
temporary full-hardware shared-memory systems, suggesting that additional hardware 
support is required to close the performance gap between hardware- and software-based 
shared-memory systems. 
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Abstract. The page aggregation technique consists of considering a 
larger granularity unit than a page, in a page-based DSM system. In 
this paper an initial evaluation of the influence of the page aggregation 
technique in the speedup of a DSM system is done, by applying it in 
two DSMs: JIAJIA and Nautilus. TreadMarks, a DSM well known by 
the scientiflc community, is also included in this comparison as a refer- 
ence for optimal speedups. Different granularity sizes are considered in 
this study: 4kB, 8kB, 16kB and 32kB. The benchmarks evaluated in this 
study are SOR (from Rice University), LU and Water N-Squared (both 
from SPLASH-II). The first results show that this technique can improve 
the JIAJIA’s speedup by ’up to 4.1% and the Nautilus’s speedup by up 
to 37.7%. 



1 Introduction 

In recent years, several factors have contributed to make the network of work- 
stations (NOW) the most used as a parallel computer: 

1. the evolution of microprocessors; 

2. the decrease of costs of interconnection technologies; 

3. the adoption of hardware on the shelf components. 

Big projects such as Beowulf[ll] can be mentioned to exemplify these tendencies. 

The Distributed Shared Memory (DSM) paradigm[8], which has been widely 
discussed for the last 9 years, is an abstraction of shared memory which permits 
viewing of a network of workstations as a shared memory parallel computer. By 
moving or replicating data[8], shared memory uniform accesses are done by the 
different nodes, implementing in this way the DSM’s main aim. These movements 
and/or replications of data guarantee its consistency, allowing programs done by 
physically shared memory machines to be easily ported and developed[l], since 
to develop message passing programs is more difficult than to develop shared 
memory programs. The research of the DSM area can be resumed mainly by the 
development and evolution of a large number of consistency models and DSM 
systems. Carter [1] has classified the DSM evolution in two generations: 
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— a big number of consistency messages and the adoption of the sequential 
consistency model; one can exemplify this with the Ivy [6]; 

— a drastic reduction of the number of consistency messages by the adoption of 
the release consistency model, applying techniques to reduce false sharing; 
several examples can be mentioned: Munin[2], Quarks[7], TreadMarks[3][19], 
CVM[10], Midway[9], JIAJIA[4] and Nautilus[5]. 

In terms of granularity, in most cases page-grained DSMs approaches were 
chosen instead of fine-grained ones. Also, the study of Iftode[17] showed that for 
several applications from SPLASH-II, page-grain DSMs perform similarly to or 
better than fine-grain ones, although generally higher bandwidth and message 
handling costs favor page-based DSM while lower latency favors fine-grained 
approach[17]. 

In page-based DSM systems, shared memory accesses are detected using 
virtual memory protection, thus one page is the unit of access detection and 
can be used as unit of transfer. Depending on the memory consistency model 
and the situation, also the diffs^ are used as an unit of transfer. For example, in 
homeless lazy release consistency (LRC), such as TreadMarks, if the node has a 
dirty page, diffs are fetched from several nodes, when an invalid page is accessed. 
On the other hand, in a home-based DSM like JIAJIA, pages are fetched from 
the home nodes when a remote page fault occurs. 

The unit of access detection and the unit of transfer can be increased by using 
a multiple of the hardware page size. In this way, if aggregation is done, false 
sharing is increased. Aggregation reduces the number of messages exchanged. If 
a processor accesses several pages successively, a single page fault request and 
reply can be enough, instead of multiple exchanges, which are usually required. 
A secondary benefit is the reduction of the number of page-faults. On the other 
hand, false sharing can increase the amount of data exchanged and the number 
of messages[16]. 

The main contribution of this paper is to evaluate the page aggregation 
technique[16], for different benchmarks, on two different DSMs: JIAJIA and 
Nautilus. The page aggregation technique is evaluated in both DSMs with a PC 
network, with a free operating system. In order to have a reference of optimal 
speedups, TreadMarks’s speedups are included in this study. Unfortunately, the 
TreadMarks version used is a demo version (1.0.3), thus the source code is not 
available and it was not possible to evaluate it with other grain sizes (default is 
4kB), in order to compare it with JIAJIA and Nautilus. 

The evaluation comparison is done by applying some different benchmark- 
s: LU (kernel from SPLASH-II) [15], SOR (from Rice University) and Water 
N-Squared (from SPLASH-II). The environment of the comparison is a 8PC 
network interconnected by a fast-Ethernet shared media. The operating system 
used in each PC is Linux (2.x). This study is a preliminary evaluation of this 
technique and four aggregation sizes are used: 4kB, 8kB, 16kB and 32kB. 

In section 2, a brief description of Nautilus is given. In section 3, the page 
aggregation method and its consequences are explained. In sections 4 and 5, 

^ diffs: codification of the modifications suffered by a page during a critical section. 
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some benchmarks are evaluated showing the application of the page aggregation 
technique. Section 6 concludes this work. 

2 Nautilus 

The main motivation of the new software DSM Nautilus is to develop a DSM 
with a simple consistency memory model, in order to provide good speedups, 
and compatible with TreadMarks and JIAJIA. This idea is very similar to the 
ideas utilized by JIAJIA, mentioned in the studies of Hu[4] and Eskicioglu[12], 
but Nautilus makes use of some other techniques, which distinguishes it from 
JIAJIA. These techniques will be mentioned below. In order to be portable, it 
was developed as a runtime library like TreadMarks, CVM and JIAJIA, because 
there is no need to change the operating system kernel[2]. 

To summarize the Nautilus features: i) scope consistency only sending con- 
sistency messages to the owner of the pages and invalidating pages in the acquire 
primitive; ii) multiple writer protocols; hi) multi-threaded DSM; iv) no use of 
SIGIO signals(which notice the arrival of a network message); v) minimization of 
diffs creation; vi) primitives compatible with TreadMarks, Quarks and JIAJIA; 
vii) network of PCs; viii) operating under Linux 2.x; ix) UDP protocols. 

Nautilus is a page-based DSM, like TreadMarks and JIAJIA. In this scheme, 
pages are replicated through the several nodes of the net, allowing multiple 
reads and writes[8], thus improving speedups. By adopting the multiple writer 
protocols proposed by Carter[2], false sharing is reduced. The mechanism of co- 
herence adopted is write invalidation[8], because several studies [2] [3] [4] [12] show 
that this type of mechanism provides better speedups for general applications. 
Nautilus uses scope consistency model, which also reduces the false sharing effec- 
t[14j. In Nautilus, this consistency model is implemented through a locked-based 
protocol[13j. 

Nautilus is the first multi-threaded DSM system implemented on top of a 
free Unix platform that uses the scope consistency model because: 

1) there are versions of TreadMarks implemented with threads, but it does 
not use scope consistency memory model; 

2) JIAJIA is a DSM system based on scope consistency, but it is not imple- 
mented using threads. 

3) CVM[10] is a multi-threaded DSM system, but it uses lazy release consis- 
tency and at the moment, it does not have a Linux-based version. 

4) Brazos[18] is a multi-threaded DSM and it uses scope consistency, but it’s 
implemented on a Windows NT platform. 

Nautilus manages the shared memory using a home-based scheme, but with 
a directory structure of all pages instead of only a structure of the relevant pages 
(cached), used by JIAJIA. Also, a different memory organization from JIAJIA, 
explained in item 2.1. 

To improve the speedup of the applications submitted. Nautilus uses two 
techniques: i)multi-threaded implementation; ii) diffs of pages that were written 
by the owner are not created. 
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The multi-threaded implementation of Nautilus permits: 

1) minimization of context switch: Nautilus’s threads are used, for example, 
to help the reply a request for a page, to apply a diff and not to run a user 
program, as in Brazos; 

2) no use of SIGIO signals: most page-based DSM systems created until 
present implemented on top of a Unix platform uses SIGIO signals to activate 
a handler to take care of the arrival of messages which come from the network. 
One of the threads remains blocked trying to read messages from the net. While 
blocked, it remains slept and thus is a non consuming CPU. This technique 
decreases the overhead of the DSM and allows as much CPU time as possible to 
the user program. Thus, Nautilus is the first scope consistency DSM system of 
the second generation which does not use a SIGIO signal in its implementation. 

On the same way that TreadMarks and JIAJIA do. Nautilus is also concerned 
with network protocols. So, it also uses UDP protocol to minimize overheads. 

Nautilus also deals with the compatibility of primitives. Its primitives are 
simple and totally compatible with TreadMarks, JIAJIA and Quarks; as a result 
there is no need for code rearrangements. One example of this compatibility 
is that in this study, LU and SOR are converted from JIAJIA and SOR from 
TreadMarks, basically changing the name of the primitives. 

Like TreadMarks and JIAJIA, Nautilus is also concerned with synchroniza- 
tion messages. To minimize the number of messages, the synchronization mes- 
sages would carry consistency information, minimizing the emission of the latter. 

Nautilus follows the lock-based protocol proposed by JIAJIA[12], because 
of its simplicity, thus minimizing overheads. Resuming this protocol, the home 
nodes of the pages always contain a valid page, and the diffs corresponding to 
the remote cached copies of the pages are sent to the home nodes. A list with 
the pages to be invalidated in the node is attached to the acquire lock message. 

JIAJIA[3] only contains information of the relevant pages, the cached copies 
of the pages, because it argues that it reduces the space overhead the system[4]. 
On the other hand. Nautilus maintains a local directory structure for all pages, 
since it does not occupy a relevant space and does not increase the overhead of 
the system. Differently, this helps increasing the speedup of the system. 

Following JIAJIA [3] [12] concept, in Nautilus, the owner nodes of the pages 
do not need to send the diffs to other nodes, according to the scope consistency 
model. So, diffs of pages written by the owner are not created, which is believed 
to be more efficient than the lazy diff creation of TreadMarks. 

The implementation of the state diagram of the page transitions is done in 
Unix using the mprotect() primitive. With this primitive, pages can be in read- 
only (RO), invalid (INV) or read- write (RW) states, thus pages can have their 
states changed easily. 



3 Page Aggregation 

In terms of implementation, following the other DSMs directions, in Nautilus 
there is a handler responsible for requesting a page from a remote node when a 
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segmentation fault occurs. When a page is accessed and it’s in the INV state, a 
SIGSEVG signal is generated and the respective handler, as it was said before, 
requests the page from the home node. When the page arrives, the primitive 
mprotect() changes the state from INV to RO. 

When the page is written, another SIGSEGV signal is generated and the 
primitive mprotect() changes the state of the page from RO to RW. After the 
generation of the diffs, also with the mprotect() primitive, pages go to RO state 
again. And, when the write-notices arrive, indicating the pages are modified by 
other nodes, pages go to INV state. 

The primitive mprotect() permits the consideration of a granularity multiple 
of a page, thus giving the same permission for a region multiple of a page. Thus, 
this fact gives the condition to modify more than one page at the same time, 
which is named page aggregation technique. 

The study [16] says that if aggregation is done, false sharing is increased 
and aggregation reduces the number of messages exchanged. Also, the processor 
accesses several pages successively, a single page fault request and reply can be 
enough, instead of multiple exchanges of requests and replies, which are usually 
required. The study [16] also shows that there is a reduction of the number of 
page-faults, but false sharing can increase the amount of data exchanged and 
the number of messages. 

By changing the page size default (4kB) to, for example, 8kB using the 
mprotect() primitive, it’s possible to evaluate the effects of the incremented size 
in page fault reduction in the speedups. 

4 Experimental Platform and Applications 

Here, the experimental platform and the applications are detailed. 



4.1 Experimental Platform 

The results reported here are collected on a 8 PC network. Each node (PC) is 
equipped with a K6 - 233 MHz (AMD)processor, 64 MB of memory and a fast 
ethernet card (100 Mbits/s) . The nodes are interconnected with a hub. In order 
to measure the speedups, the network above was completely isolated from any 
other external networks. Each PC runs Linux Red Hat 6.0. The experiments are 
executed with no other user process. 

In this study, four sizes are considered for page size: 4kB, which is the default 
(memory hardware), 8kB, 16kB and 32kB. 

4.2 Applications 

The test suite includes three programs: LU (from SPLASH-II[15]), SOR (from 
Rice University) and Water N-Squared (from SPLASH-II). SPLASH-II is a col- 
lection of parallel applications implemented to evaluate and design shared mem- 
ory multiprocessors. 
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“The LU kernel from SPLASH II factors a dense matrix into the product 
of a lower triangular and upper triangular matrix. The NxN matrix is divided 
into an nxn array of bxb blocks (N = n*b) to exploit temporal locality on sub- 
matrix elements. The matrix is factored as an array of blocks, allowing blocks 
to be allocated contiguously and entirely in the local memory of processors that 
own them. LU is a kernel from SPLASH2 benchmarks that has a rate compu- 
tation/communication 0(N^)/0(N^), which increases with the problem size N. 
The nodes frequently synchronize in each step of computation and none of the 
phases are fully parallelized[4] . 

Water is an N-body molecular simulation program that evaluates forces and 
potentials in a system of water molecules in the liquid state using a brute force 
method with a cutoff radius. Water simulates the state of the molecules in steps. 
Both intra- and inter-molecular potentials are computed in each step. The most 
computation- and communication-intensive part of the program is the inter- 
molecular force computation phase, where each processor computes and updates 
the forces between each of its molecules and each of the n/2 following molecules 
in a wrap-around fashion[12]. 

SOR from Rice University solves partial differential equations (Laplace e- 
quations) with an Over-Relaxation method. There are two arrays, black and red 
array allocated in shared memory. Each element from the red array is computed 
as an aritmethic mean from the black array and, each element from the black 
array is computed as an aritmethic mean from the red array. Communication 
occurs across the boundary rows on a barrier. The SOR from Rice University 
solves Laplace partial equations. For a number of iterations it has two barriers 
each iteration and communication occurs across boundary rows on a barrier. The 
communication does not increase with the number of processors and the relation 
communication/computation reduces as the size of the problem increases[4]. 

5 Result Analysis 

Before presenting the results and their analysis, it is necessary to emphasize 
that the execution time for number of nodes = 1 in all evaluated benchmarks 
is obtained from the sequential version of the benchmarks without any DSM 
primitive. So, the primitive used to allocate memory to obtain the sequential time 
(t(l) and number of nodes = 1) is malloc() , default primitive of C programming. 

In order to have an accurate, homogeneous and fair comparison, the same 
programs are executed using TreadMarks (version 1.0.3), JIAJIA (version 2.1) 
and Nautilus (version 0.0.1). 

Table 1 shows some features and results of the benchmarks: sequential time 
(t(l)), 8-processor parallel run time(t(8)), speedup for 8 nodes(Sp8), remote get 
page request counts per node (gp) and number of local SIGSEGV per node (SG). 
The sequential time t(l) was obtained from the sequential program without DSM 
primitives, as has already been mentioned. 

For table 1 and for the graphics below, there are several extensions: “J4k” 
means JIAJIA using 4kB page size, “J8k” means JIAJIA using 8kB page size. 
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app 


LU 


Water 


SOR 


t(l) 


350.90 


2983.00 


29.10 


t(8).Tmk 


55.45 


403.20 


8.66 


t(8).J4k 


62.81 


429.82 


21.22 


t(8).J8k 


61.24 


432.59 


14.44 


t(8).J16k 


60.24 


440.03 


20.87 


t(8).J32k 


60.63 


452.98 


30.69 


t(8).N4k 


55.17 


426.88 


7.67 


t(8).N8k 


55.56 


422.96 


6.30 


t(8).N16k 


58.03 


428.99 


5.60 


t(8).N32k 


60.20 


437.06 


5.56 


Sp8.Tmk 


6.33 


7.40 


3.36 


Sp8.J4k 


5.59 


6.94 


1.37 


Sp8.J8k 


5.73 


6.90 


2.02 


Sp8.J16k 


5.82 


6.78 


1.39 


Sp8.J32k 


5.79 


6.59 


0.95 


Sp8.N4k 


6.36 


6.99 


3.79 


Sp8.N8k 


6.32 


7.05 


4.62 


Sp8.N16k 


6.05 


6.95 


5.20 


Sp8.N32k 


5.83 


6.82 


5.22 


SG.J4k 


87 


106 


112 


SG.J8k 


44 


78 


56 


SG.jiek 


22 


63 


56 


SG.J32k 


11 


56 


56 


SG.N4k 


7980 


851 


12425 


SG.N8k 


5029 


602 


7912 


SG.Niek 


3030 


532 


3990 


SG.N32k 


1650 


398 


2010 


gp.J4k 


3542 


1921 


893 


gp.J8k 


2220 


1206 


461 


gp.JlGk 


1212 


849 


461 


gp.J32k 


663 


632 


461 


gp.N4k 


1528 


445 


118 


gp.N8k 


1232 


312 


72 


gp.NlGk 


940 


281 


51 


gp.N32k 


540 


195 


42 



Table 1. table comparing TreadMarks, JIAJIA and Nautilus for page sizes: 4k, 8k, 
16k and 32k 
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“N16k” means Nautilus using 16kB page size and “N32k” means Nautilus using 
32kB page size. 

There are some constraints with the TreadMarks version (1.0.3) 
used: 

i) the applications were executed and the speedups measured using Nautilus 
running on up to 8 nodes; 

ii) bigger input sizes: the shared memory size is limited in this version; 

iii) the source of this demo version is not available, thus it was neither possible 
to evaluate TreadMarks with other page size nor to measure the parameters gp 
(get page request) and SG (local SIGSEGV). 

5.1 LU 

By looking at table 1 and by applying the aggregation technique, i.e., by increas- 
ing the page size from 4kB to 32kB, JIAJIA’s speedup was improved by up to 
4.1% due to the reduction of the number of SIGSEGVs (SG.J* rows) from about 
39.04% up to 49.4% and the reduction of the number of gp (gp.J* rows) from 
about 37.32% up to 45.4%. Eor the page size of 32kB, the JIAJIA’s speedup 
is lower than for the page size of 16kB due to the different data distribution 
resulting from the application of the page aggregation technique. Eor page size 
of 4kB to 16kB, it is possible to notice by looking at figure 1 or by observing the 
Sp.J* rows of table 1, that the increase of the speedups of JIAJIA grows with 
the increase of the page size. 



Speedup dLu 




Nutrber of Nodes 



Fig. 1. speedups of LU: N=1792 



Eor 8 nodes, from table 1, a reduction of 8.33% of Nautilus’s speedup is 
observed when the page aggregation technique is applied (increasing page size 
from 4kB to 32kB). Although the number of SIGSEGVs and the number of 
get page requests decreases by up to 45.4% and 42.6% respectively, as can be 
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observed from 1, the employment of the page aggregation technique changes the 
data distribution. This new data distribution changes the home nodes, resulting 
in a distribution that is not as adequate as the initial one (4kB) , decreasing the 
speedups of Nautilus. The speedups of Nautilus with this technique applied can 
also be observed in figure 1. 

By comparing TreadMarks with JIAJIA, from table 1 and from figure 1, it 
can be noticed that TreadMarks is faster than JIAJIA by up to 13.24% for a 
page size of 4kB, and 8.76% faster for a page size of 16kB. Generically without 
the page aggregation technique (4kB’s page size), the lazy release consistency 
model and the lower number of diffs sent are responsible for the better speedups 
of TreadMarks over JIAJIA for the LU benchmark. 

By comparing Nautilus with TreadMarks, for a page size of 4kB, i.e. without 
the page aggregation technique. Nautilus is 0.47% faster than TreadMarks, due 
to better data distribution, the avoidance of SIGIO signals and multi-threading. 
By applying the page aggregation technique, the speedup of Nautilus decreases, 
and TreadMarks becomes up to 8.58% faster than Nautilus, when the latter uses 
pages of 32kB (N32k). 

Comparing JIAJIA with Nautilus, it is possible to notice from table 1 and 
from figure 1 that Nautilus is faster than JIAJIA by up to 13.77%. Although 
the number of SIGSEGVs of JIAJIA is two orders of magnitude lower than 
Nautilus, as can be noticed from table 1, the number of get page requests is up to 
50.0% lower for Nautilus, thus improving the data locality and thus, its speedup. 
The multi-threading and the avoidance of SIGIO signals helps to improve the 
performance of Nautilus . 



5.2 Water 



Speedup of Water 




2 3 4 5 6 7 8 

Nurrber of Nodes 



Fig. 2. speedups of Water: 1728 molecules and 25 steps 
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By looking at table 1 and by applying the aggregation technique increasing 
the page size from 4kB to 32kB, there is a reduction of about 5.0% in JIAJI- 
A’s speedup. Still observing this table, there is a reduction of the number of 
SIGSEGVs (SG.J* rows) by up to 26.4% and a reduction of the number of gp 
(gp.J* rows) by up to 37.2%. The employment of page aggregation technique 
changes the data distribution and in this new data distribution, the home nodes 
are changed, resulting in a distribution that is not as adequate as the initial 
(4kB). 

For 8 nodes, from table 1, it can be observed that Nautilus’s speedup is 
increased by up to 1.0%, when the page aggregation technique is applied (in- 
creasing page size from 4kB to 32kB). Also, the number of SIGSEGVs and the 
number of get page requests decreased by up to 29.3% and 30.6% respectively. 
The problem with this benchmark is the high level of synchronization presented, 
which dominates its behavior. 

For Water, TreadMarks is up 6.6% to faster than JIAJIA with a page size of 
4k (J4k), and 12.29% faster than JIAJIA using the 32kB page size (J32k), since 
the page aggregation technique changed the JIAJIA data distribution. Generi- 
cally comparing TreadMarks with JIAJIA, without page aggregation technique, 
TreadMarks is faster than JIAJIA due to the LRC model adopted by TreadMarks 
and the high number of synchronization messages of Water. 

Comparing TreadMarks with Nautilus, TreadMarks is 4.96% faster than Nau- 
tilus using a page size of 8kB (N8k), and 8.50% faster than Nautilus using the 
32kB page size (N32k). Generically for Water, without the page aggregation tech- 
nique (page size of 4kB), TreadMarks is up to 5.86% faster than Nautilus due to 
the lazy consistency model and high synchronization of the Water benchmark. 
Also, Nautilus’s semaphore implementation is still under development. 

Comparing JIAJIA with Nautilus, it is possible to notice from table 1 and 
from figure 2 that Nautilus is faster than JIAJIA by up 1.65%. Although the 
number of SIGSEGVs of JIAJIA is one order of magnitude lower than Nautilus, 
as can be noticed from table 1, the number of get page requests is up to 76.8% 
lower for Nautilus, thus a lower number of pages transfered improvies the da- 
ta locality and so, its speedup. And with the avoidance of SIGIO signals, the 
multithreading helps to improve its speedup. 

5.3 SOR 

As can be noticed from table 1, the speedups of JIAJIA are very unusual. There- 
fore, any related speedups are not considered for SOR analysis. 

By observing table 1 and figure 3, for SOR benchmark, the page aggregation 
technique decreased the number of SIGSEGVS by up to 49.6%, and also the 
number of pages requested by up to 39.0%. These reductions justify the increase 
of the speedups of 37.7% for 8 nodes, for Nautilus DSM. 

Comparing Nautilus with TreadMarks, without the page aggregation tech- 
nique applied. Nautilus is up to 12.80% faster. By applying this technique. Nau- 
tilus becomes up to 55.36% faster than TreadMarks, the best known reference 
of the DSM area. 
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Fig. 3. speedups of SOR: 1792x1792 




Generically for this benchmark, without the page aggregation technique ap- 
plied, the excellent speedup of Nautilus over TreadMarks can be justified by the 
data distribution (choice of the page owners) adopted by itself improving the 
matrix data locality (minimizing the number of messages through the net) and 
giving a lower cold start up time to distribute shared data. Also the avoidance 
of SIGIO signals and the multi-threading help to improve the SOR speedup. 



6 Conclusion 



In this paper the page aggregation technique was evaluated for two different 
DSMs: JIAJIA and Nautilus. The main contribution was to evaluate the ap- 
plication of this technique and its effects on the speedup of these DSMs. Also, 
JIAJIA, TreadMarks and Nautilus were evaluated and compared. 

It was shown that the page aggregation technique has improved Nautilus 
speedups by up to 37.7% for the SOR benchmark and improved JIAJIA speedup 
up to 4.61 % for the LU benchmark, reducing the number of page faults and 
the number of SIGSEGVs. For Water, the improvement of the technique applied 
was up to 1.0%, due to the high synchronization and the access method of this 
benchmark. In addition, the speedup of the three DSMs, JIAJIA, TreadMarks 
and Nautilus, without the page aggregation technique are compared, but not as 
the main goal of the paper. In a future study, other benchmarks from SPLASH-2 
and NAS benchmark will be evaluated. 

And the intention is to acquire a complete version of TreadMarks, including 
its sources, to evaluate and compare it with JIAJIA and Nautilus for several 
grain sizes. 





An Evaluation of Page Aggregation Technique on Different DSM Systems 



145 



References 

1. Carter J. B., Khandekar D., Kamb L., Distributed Shared Memory: Where We are 
and Where we Should Headed, Computer Systems Laboratory, University of Utah, 
1995. 

2. Carter J. B., Efficient Distributed Shared Memory Based on Multi-protocol Release 
Consistency, PHD Thesis, Rice University, Houston, Texas, September, 1993. 

3. Keleher P. , Lazy Release Consistency for Distributed Shared Memory, PHD Thesis, 
University of Rochester, Texas, Houston, January 1995. 

4. Hu W., Shi W., Tang Z., JIAJIA: An SVM System Based on a new Cache Coher- 
ence Protocol, technical report no. 980001, Center of High Performance Computing 
, Institute of Computing Technology, Chinese Academy of Sciences, January, 1998. 

5. Marino M. D., Campos G. L., Sato L. M.; An Evaluation of the Speedup of Nautilus 
DSM System published at lASTED PDCS99. 

6. Li K, Shared Virtual Memory on Loosely Coupled Multiprocessors, PHD Thesis, 
Yale University, 1986. 

7. Swanson M., Stoller L., Carter J., Making Distributed Shared Memory Simple, Yet 
Efficient, Computer Systems Laboratory, University of Utah, technical report , 
1998. 

8. Stum M. , Zhou S. , Algorithms Implementing Distributed Shared Memory, Uni- 
versity of Toronto, IEEE Computer v.23 , n.5 , pp. 54-64 , May 1990. 

9. Bershad B. N. , Zekauskas M. J. , SawDon W. A. , The Midway Distributed Shared 
Memory System , COMPCOM 1993. 

10. Keleher P., The Relative Importance of Concurrent Writers and Weak Consisten- 
cy Models, in Proceedings of the 16th International Conference on Distributed 
Computing Systems (ICDCS-16), pp. 91-98, May 1996. 

11. Becker D., Merkey P.; Beowulf: Harnessing the Power of Parallelism in a Pile-of- 
PCs, Proceedings, IEEE Aerospace, 1997. 

12. Eskicioglu, M.S., Marsland T.A., Hu W, Shi W.; Evaluation of the JIAJIA DSM 
System on High Performance Computer Architectures, Proceeding of the Hawaii 
International Conference on System Sciences, Maui, Hawaii, January, 1999. 

13. Hu W. , Shi W., Tang Z.; A lock-based cache coherence protocol for scope consis- 
tency, Journal of Computer Science and Technology, 13(2):97-109, March, 1998. 

14. Iftode L., Singh J.P., Li K; Scope Consistency: A bridge between release consistency 
and entry consistency. Proceedings of the 8th ACM Annual Symposium on Parallel 
Algorithms and Architectures (SPAA’96), pp. 277-287, June, 1996. 

15. Woo S., Ohara M., Torrie E., Singh J.P., Gupta A.; The SPLASH-2 program- 
s: Characterization and methodological considerations. In Proceedings of the 22th 
Annual Symposium on Computer Architecture, pages 24-36, June, 1995. 

16. Amza C., Cox A. L., Dwarkadas S., Jin L. J., Rajamani K., Zwaenepoel W., 
Adaptive Protocols for Software Distributed Shared Memory, Proceedings of IEEE, 
Special Issue on Distributed Shared Memory, pp. 467-475, March 1999. 

17. Iftode L., Singh J. P.; Shared Virtual Memory: Progress and Challenges', Proceed- 
ings of the IEEE, Vol 87, No. 3, March 1999, 1999. 

18. Speight E., Bennett J. K., Brazos: A third generation DSM system. In Proceedings 
of the 1997 USENIX Windows/NT Workshop, pp. 95-106, August, 1997. 

19. Keleher P., Update Protocols and Iterative Scientific Applications, In The 12th 
International Parallel Processing Symposium, March 1998. 




Nanothreads vs. Fibers for the Support of Fine 
Grain Parallelism on Windows NT/2000 
Platforms 



Vasileios K. Barekas, Panagiotis E. Hadjidoukas, 

Eleftherios D. Polychronopoulos, and Theodore S. Papatheodorou 

High Performance Information Systems Laboratory 
Department of Computer Engineering and Informatics, University of Patras 
Rio 26500, Patras, Greece 
{bkb,peh, edp, tspjOhpclab. ceid.upatras .gr 
http : //www.hpclab. ceid.upatras .gr 



Abstract. Support for parallel programming is very essential for the ef- 
heient utilization of modern multiprocessor systems. This paper focuses 
on the implementation of multithreaded runtime libraries used for the 
hne-grain parallelization of applications on the Windows 2000 operat- 
ing system. We have implemented and introduce two runtime libraries. 
The first one is based on standard Windows user-level fibers, while the 
second is based on nanothreads. Both follow the Nanothreads Program- 
ming Model. A systematic evaluation comparing both implementations 
has also been conducted in three levels: the user-level thread packages, 
the runtime libraries and the applications level. The results demonstrate 
that nanothreads outperform the Windows hbers. The performance gains 
of the thread creation and context switching mechanisms are reflected on 
both runtime libraries. Experiments with fine-grain applications demon- 
strate up to 40% higher speedup in the case of nanothreads compared to 
that of hbers. 



1 Introduction 

During the last few years, there have been significant technological advances in 
the area of workstations and servers. These systems are based on low-cost multi- 
processor configurations running conventional operating systems, like Windows 
NT. Although the performance of these systems is comparable to that of other 
more expensive small-scale Unix-based multiprocessors, the software used is in- 
adequate to utilize the existing hardware efficiently. Parallel processing on these 
systems is in a primitive stage, due to the lack of appropriate tools for the effi- 
cient implementation of parallel applications. The parallelization of a sequential 
application requires the explicit use and knowledge of the underlying thread ar- 
chitecture. Furthermore, the user himself must detect the potential parallelism. 
As we show in this paper, the existing support provided by the Windows is inad- 
equate for the efficient implementation of a wide range of parallel applications. 
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On the other hand, most of the high-end multiprocessors running Unix-like 
operating system provide adequate and convenient tools to the user in order 
to build parallel applications, which can be executed with the lowest possible 
overhead. Such advanced tools consist of an automatic parallelizing compiler, an 
optimized multi-threading runtime library and the appropriate operating system 
support [1,3]. Directives are inserted manually or automatically at compile time 
into the sequential code to specify the existing parallelism [11]. The modified 
code is analyzed and the directives are interpreted into appropriate runtime 
API functions. The final code is compiled and linked with the runtime library 
to produce the parallelized executable. 

In this paper, we present the implementation of two multithreaded runtime 
libraries, both based on user- level threads running on Windows 2000. These li- 
braries are specially designed to provide the user with the necessary support for 
the efficient parallelization of applications. The first library, called FibRT (Fibers 
RunTime), uses the standard Windows fibers, while the second one, called NTLib 
(NanoThreads Library for NT), uses nanothreads, a custom user-level threads 
package. The NTLib runtime library was ported to Windows 2000 operating 
system from its original implementation on the IRIX operating system [I]. Both 
FibRT and NTLib libraries are implemented according to the Nanothreads Pro- 
gramming Model [12]. These libraries export the same API to the user, providing 
the same functionality. For the rest of this paper, the term runtime library will 
refer to both FibRT and NTLib, unless otherwise specified. We compare both 
implementations in terms of runtime overhead, and the performance gains by 
using them in the parallelization and execution of real applications. 

The rest of this paper is organized as follows; Section 2 provides the necessary 
background. In Section 3, we present the two runtime libraries introduced in this 
paper, together with the necessary details of our implementations. Performance 
study and experimental results are presented in Section 4. In Section 5, we 
present related work; finally, we summarize in Section 6. 



2 Background 

In this section we outline the multithreading support provided in the Windows 
2000 operating system, for both kernel-level and user-level threads, along with 
the Nanothreads Programming Model. 



2.1 Windows Multithreaded Architecture 

The Windows 2000 operating system supports multiple kernel-level threads, 
through a powerful thread management API [14]. These threads are the op- 
erating system’s smallest kernel-level objects of execution and processes may 
consist of one or more threads. Each thread can create other threads that share 
the same address space and system resources having however, independent ex- 
ecution stack and thread specific data. Kernel-level threads are scheduled on 
a system wide basis by the kernel in order to be executed on a processor. It 
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is through threads that Windows allows programmers to exploit the benefits 
of concurrency and parallelism. Since these threads are kernel-level, their state 
resides in the operating system kernel, which is responsible for their scheduling. 

Besides kernel-level threads, Windows also provides user-level threads, called 
fibers. These are the smallest user-level objects of execution. Fibers run in the 
context of the threads that schedule them and are unknown to the operating 
system [7]. Each thread can schedule multiple fibers. However, a fiber does not 
have all the same state information associated with it as that associated with a 
thread. The only state information maintained for a fiber is its stack, a subset 
of its registers, and the fiber data provided during its creation. The saved reg- 
isters are the set of registers typically preserved across a function call. A fiber 
is scheduled when switching to it from another fiber. The operating system still 
schedules threads to run. When a thread running fibers is preempted, its cur- 
rently running fiber is also preempted. Figure 1 illustrates the overall Windows 
multithreaded architecture. 




Fig. 1. Threads architecture in the Windows 2000 operating system, fibers reside 
completely in the user space 



The efficient utilization of a multiprocessor system requires multiple kernel- 
level threads of control to be active at any time. The presence of more than 
one processors, means the simultaneous execution of corresponding number of 
threads. Although Windows thread API provides extensive functionality, kernel- 
level threads overhead makes them insufficient for fine-grain parallelization of 
applications. An application that uses hundreds of ready-to-execute threads re- 
serves a significant part of the process address space. Furthermore, a large num- 
ber of context switches occurs resulting in excessive scheduling overhead. 

On the other hand, fiber management occurs entirely in user space and con- 
sequently their overhead is significantly lower than that of kernel-level threads. 
The cost of suspending a kernel-level thread is an order of magnitude more than 
that of a fiber’s switching, which is performed in user-space. Similarly, there 
is an order of magnitude difference in the cost of creation between fibers and 
kernel-level threads. The application programmer is responsible for the manage- 
ment of fibers such as allocating memory, scheduling them on kernel threads 
and preempting them. This means that the user has to manage the scheduling 
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of fibers, while the scheduling of threads is controlled entirely by the operating 
system. Thus, fibers programming becomes more difficult. Obviously, an appro- 
priate runtime system that provides support for programming with fibers can 
take advantage of all of their potential benefits. 

2.2 The Nanothreads Programming Model 

The Nanothreads Programming Model (NPM) [12] exploits multiple levels of 
loop and functional parallelism. The integrated compilation and execution en- 
vironment consists of a parallelizing compiler, a multithreaded runtime library 
and the appropriate operating system support. According to this model, applica- 
tions are decomposed into fine-grain tasks and executed in a dynamic multipro- 
grammed environment. The parallelizing compiler analyzes the source program 
to produce an intermediate representation, called the Hierarchical Task Graph 
(HTG). The HTG is an directed acyclic graph with various hierarchy levels, 
which are determined by the nesting of loops and correspond to different levels 
of exploitable parallelism. Nodes at the same level of the hierarchy are connected 
with directed arcs that represent data or control dependencies between them. 
They can also have several input and output dependencies. Any arc between 
two nodes implies that these nodes have to be executed sequentially. A node is 
instantiated with a task that uses a user-level thread to execute its associated 
code. The task is not ready to be executed, as its dependencies may be unre- 
solved. When all its dependencies have been satisfied, the task becomes ready 
for execution. 

The runtime library controls the creation of tasks, ensuring that the gener- 
ated parallelism matches the number of processors allocated to the application 
by the operating system . The overhead of the runtime library is low enough 
to make the management of parallelism affordable. In other words, the runtime 
library implements dynamic program adaptability to the available resources by 
adjusting the granularity of the generated parallelism. The runtime library en- 
vironment co-operates with the operating system, which distributes physical 
processors among the running applications. The operating system provides vir- 
tual processors to applications, as the kernel abstraction of physical processors 
on which applications can execute. Virtual processors provide user-level con- 
texts for the execution of the tasks’ user-level threads. The main objective of 
the NPM is that both application scheduling (user-level threads to virtual pro- 
cessors mapping) at the user-level, and virtual processor scheduling (virtual to 
physical processors mapping) at the kernel-level, must be tightly coordinated in 
order to achieve high performance. 

3 Runtime Libraries Implementation 

In this section we present a description of our runtime libraries implementation. 
These libraries are primarily designed to provide support for the parallel execu- 
tion of tasks at the backend of a parallelizing compiler. The compiler takes as 



150 



Vasileios K. Barekas et al. 



input either sequential code or code annotated with special directives that indi- 
cate the presence of code that can be executed in parallel by multiple processors. 
The latter case is the most frequently technique used today after the adoption 
of the OpenMP standard [11]. The compiler translates the annotated code into 
parallel code that takes advantage of the runtime libraries. Beside the compiler, 
our runtime libraries can be used by a programmer directly, for the development 
of applications that use the exported API. For the rest of this section, the term 
user will refer to either the parallelizing compiler or the application programmer. 

As stated before, we have implemented two multithreaded runtime libraries 
for the Windows NT/2000 environment; the NTLib library that uses custom 
user-level threads, named nanothreads, and the FibRT library that uses the 
standard Windows user-level fibers. The custom user-level threads used by the 
NTLib are based on the QuickThreads package [5] , which provides similar func- 
tionality for non-preemptive thread management, with that provided for fibers by 
the Windows API. Additionally, the QuickThreads package enables the passing 
of multiple arguments to the thread function. Both libraries have been designed 
to provide the user with a suitable interface for the exploitation of application 
parallelism. Their light-weight user-level threads support fine-grain parallelism, 
and the programming model used allows the exploitation of multiple levels of 
parallelism. 

The currently exported API implements the Nanothreads programming mo- 
del, which has been described in Section 2.2. Other programming models such as 
the fork-join, which is required for the OpenMP standard, can be implemented 
easily using the existing infrastructure. Both NTLib and FibRT runtime libraries 
export the same API to the user, providing the same functionality. This API is 
responsible for the task management, the handling of the ready task queues, 
the control of the dependencies between the tasks and the initialization of the 
environment. The implementation details are described in the rest of this section. 
A detailed description of the exported API can be found in [2]. 

Task Management. A task is the fundamental object that the runtime libraries 
manage. Tasks are blocks of application code, that can be executed in parallel. 
The responsibility of our runtime libraries is to instantiate them using user- 
level threads. Each task has to execute some work that is represented by a 
user supplied function, which can take multiple arguments. Task management 
implementation is different in the two runtime libraries because they use different 
user-level thread packages for the instantiation of tasks. On both libraries, each 
task is represented by a compact structure, called task structure, which describes 
the work that a task will execute, its input and output dependencies, the virtual 
processor where the task runs on and a pointer to the associated user-level thread 
stack. In the NTLib, the creation of a task involves the creation of a nanothread, 
which will execute the task’s work. The nanothread’s stack is initialized with the 
work information (user function and its arguments). In order to execute the user 
function we just switch to the task’s user-level thread. 

On the contrary, in the FibRT we create a fiber for each task. In this case, the 
work information resides in the task structure and its pointer refers to the associ- 
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ated fiber. This fiber is created in order to execute a helper function in which we 
pass as argument a pointer to the task structure. This function pushes the user’s 
function arguments in the stack and calls the user function. When switching to 
the associated fiber, the helper function is executed, having as result the user 
function to be executed. This way the user can directly execute a function with 
multiple arguments through a fiber, bypassing the fibers disadvantage of taking 
only one argument for the user function. 

The exported API provides functions for the creation of tasks; the user spec- 
ifies the function, its arguments and task’s input and output dependencies. The 
task creation procedure involves the initialization of the task structure and the 
associated user-level thread (nanothread or fiber). In the NTLib we must allo- 
cate space for both the task structure and the nanothread’s stack, while in the 
FibRT we allocate space only for the task structure and then we call the Win32 
API function CreateFiber to create and initialize a new fiber. 

Optionally, we can reuse already allocated space; this is another distinguish- 
ing point between the two libraries. In the NTLib, we maintain a global queue 
called reuse queue, where we insert the nanothread stacks of the already exe- 
cuted tasks. When we need to create a nanothread for a new task, we extract an 
already allocated task structure and nanothread’s stack and finally we reinitial- 
ize them with the new task information. If we cannot find an already executed 
nanothread in this queue to reuse, we allocate space for a new task structure 
and the nanothread’s stack. In the FibRT we reuse only task structures, not the 
fibers themselves, due to the limitations of the fibers’ implementation. The space 
allocated for each fiber is released after the termination of its execution. 

Queue Management. Both libraries maintain ready queues, where the ready 
for execution tasks are inserted. A task is ready to be executed when all its 
input dependencies have been satisfied from its ancestors in the HTG graph. 
There is one global queue where all the processors have equal access and per- 
processor local queues, where only the owner processor has access. Although 
this configuration is very flexible and preserves affinity for the task scheduling, 
we optionally allow a processor with an empty local queue to steal work from 
another processor’s local queue to maintain load balancing and better system 
utilization. 

Dependency Management. Inside the task structure, we keep information for 
both input and output dependencies. Input dependencies are represented using 
a counter in the structure, while output dependencies are maintained by keeping 
a pointer for each successor task. When a task finishes its execution, it satisfies 
one input dependency on each one of its successors. Every time a task creates 
a subtask, additional care must be taken to preserve the input dependencies of 
the creator task. For this reason, we must increase by one the creator task input 
dependencies and declare it as the successor of the subtask. This way, we can 
maintain multiple levels of nested tasks according to the HTG structure, making 
our runtime libraries capable to exploit multiple levels of parallelism. 
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Runtime Initialization. The runtime libraries environment has to be initial- 
ized before the user calls any of the libraries routines. An initialization routine 
is provided for the instantiation of the environment and is the first library func- 
tion that an application executes. This routine takes as arguments some user 
parameters such as the maximum number of processors that the application will 
run on, the thread scheduling policy that will be used, etc. This routine initial- 
izes the ready queue environment, and creates one kernel-level thread for each 
processor that the application will run on. These kernel-level threads play the 
role of virtual processors, which will execute the application created tasks, and 
are bound to specific processors. Additionally, in the FibRT library the created 
kernel-level threads have to initialize their internal structures in order to support 
fibers, so we call the Convert ThreadToFiber Win32 API function. 

Task Scheduling. Each virtual processor enters a task dispatching routine and 
searches for ready tasks to execute, either after its creation, or when the execu- 
tion of a task has finished. To maintain the affinity across tasks, the next task is 
selected from the set of the satisfied successors of the previously ran task. If there 
are more than one satisfied successors, the first is selected and the remainders 
are inserted into the local ready queue. If there are no satisfied successors, the 
virtual processor will search for the next task, first in its local ready queue and 
then in the global ready queue. Selecting the next task using this order maxi- 
mizes the exploited task affinity. A virtual processor that cannot find any task 
with the above method can search in another processor’s local queue [13]. Using 
this technique, we maximize load balancing across the executing processors. 

4 Experimental Evaluation 

This section reports a systematic evaluation of the implementation of both run- 
time libraries. In addition, we present the performance of native Windows kernel- 
level threads, in order to show their inefficiency compared to that of user-level 
threads. More specifically, the measurements were conducted in three levels: the 
user-level threads packages, the runtime libraries, and manually parallelized ap- 
plications built using the two runtime libraries. In subsection, 4.1 an evaluation 
of the user-level threads primitives cost is presented. Subsection 4.2 reports the 
performance of the multithreaded runtime libraries. Finally, in subsection 4.3 
we use applications parallelized with our libraries to measure the performance 
delivered to the final user. 

All the experiments were conducted on a Compaq Proliant 5500 4-processor 
200MHz Pentium Pro system, running Windows 2000 Advanced Server, 
equipped with 512 MB of main memory. Both runtime libraries were developed 
using the Microsoft Visual C-F- 1- compiler. Time measurements were collected 
using the Pentium Pro processor’s time-stamp counter register. 
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4.1 Evaluating User- Level Threads 

In this section, we measure the cost of the thread primitives for both the Win- 
dows fibers and nanothreads implementation. These primitives include thread 
creation and thread context switching. Additionally, we present the cost of the 
corresponding kernel-level threads primitives. These experiments were conducted 
using only one processor, by setting the process affinity mask to this processor. 

First, we use a simple microbenchmark to measure the creation time of sus- 
pended kernel-level threads, fibers and nanothreads. Figure 2. a illustrates the 
results for various numbers of threads. The measured times are given in clock 
cycles and presented in logarithmic scale. 

The time required for the creation of a number of nanothreads is almost half 
the time needed to create the same number of fibers. This difference between 
nanothreads and fibers is due to the smaller stack which is allocated during the 
nanothreads initialization and the more independent nature of the nanothreads 
versus that of fibers, which depend on their creator kernel-level thread. During 
this experiment we didn’t use the stack reuse mechanism of the nanothreads 
package. As expected, the creation time for both user-level thread packages is 
almost an order of magnitude less than that of Windows kernel- level threads. 
This is due to the heavy nature of the kernel level threads, which require during 
their creation the initialization of both their user and kernel context, and the 
internal kernel structures. 

In the second experiment, we evaluate the cost of the context switching for 
the three thread classes. For this reason, we use a microbenchmark to create a 
number of suspended kernel-level threads, fibers, and nanothreads, which execute 
an empty function. For user-level threads, we measure the time it takes to execute 
all threads, using peer-to-peer scheduling. In the case of kernel- level threads, the 
measurement includes the time to resume the initially suspended threads until 
they finish their execution. In fact, the benchmark execution time indicates the 
cost of the context switching to a thread that has not run before in the system. 
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Our measurement includes the overhead to read the new thread’s state from 
memory; this means that several cache misses and even page faults can occur. In 
this case, the cost differs from the pure context switching cost, which is measured 
using only two threads, switching to each other several times. This implies that 
both thread states reside in the cache memory. Our experiment represents a more 
realistic measurement, since it refers to the context switching cost which occurs 
during the execution of real applications. Figure 2.b illustrates the measured 
time from the microbenchmark execution, for each case. 

The measured time for the nanothread case is almost an order of magnitude 
less than the corresponding time for fibers, and more than two orders of magni- 
tude less than kernel-level threads’ time. These results are very interesting and 
show that the nanothreads context switching mechanism is faster compared to 
that of fibers. Although both user-level thread packages preserve almost the same 
state between context switching, we measured a substantial difference in their 
costs. This difference is due to the more compact implementation of the nanoth- 
reads package which uses its stack to save its state registers, while the fibers use a 
separate context structure. Additionally, fibers preserve some state about excep- 
tion handling. The large difference between the two user-level threads packages 
and the kernel-level threads case is because of the full processor context that is 
preserved between consecutive kernel-level threads context switching. 

Both experiments demonstrate that nanothreads provide significantly less 
creation and context switching costs than the standard Windows fibers. Their 
low overhead makes them suitable for the fine-grain parallelization of applica- 
tions. Furthermore the above experiments show the inefficiency of kernel-level 
thread for their use in fine-grain parallel applications. 

4.2 Evaluating the Runtime Libraries 

In this section, we evaluate the runtime overhead that NTLib and FibRT libraries 
impose in the creation and execution of tasks. In fact, we want to see what are 
the advantages of using the faster primitives of the nanothreads in a complex 
runtime library, against fibers. 

To measure the runtime overhead we have built a microbenchmark, which 
creates a number of tasks that run an empty function. We measure the time to 
create the task structures, until they all finish. This time interval for each task 
include task structure allocation and initialization time, the creation of a user- 
level thread, task’s insertion into the ready queue, and the time until the task be 
selected for execution and finish. Additionally, in order to evaluate the influence 
of the nanothread’s stack reuse mechanism for the NTLib runtime library we 
measure the time with this mechanism turned off. 

The experiment was conducted using all four processors for the execution of 
the created tasks. The results for each case are illustrated in Figure 3. Addi- 
tionally, we present the time needed for the creation and execution of the same 
number of tasks instantiated using Windows kernel-level threads. 

The measured times show clearly that the creation and execution of tasks 
in NTLib runtime library is much faster than both FibRT runtime library and 



Nanothreads vs. Fibers for the Support of Fine Grain Parallelism 



155 



Total Runtime Overhead 

1,0E+09 



1,0E+08 



o 1.0E+07 

O 

g 1.0E+06 

O 

1,0E+05 
1,0E+04 

Fig. 3. Runtime library overhead for the creation and execution of a number of 
tasks 




4 8 16 32 64 128 256 512 1024 2048 

Number of Tasks 



kernel-level threads, in all cases. In addition, the reuse mechanism in the NTLib 
library lowers the total runtime overhead by more than 50%. Although this 
mechanism improves only the task creation procedure, by allowing a new task 
to be created using an already allocated nanothread’s stack, its influence is 
very important in the total execution time of the microbenchmark. The runtime 
overhead of the NTLib library is more than an order of magnitude lower than 
that of FibRT library and almost two orders of magnitude less than that of 
kernel- level threads. These results were expected due to the lower costs of the 
nanothreads primitives, measured in the previous section. 

4.3 Performance Evaluation of Applications 

In this section, we investigate whether the difference in the performance between 
the NTLib runtime library and the FibRT, is reflected on the execution of par- 
allel applications. We are interested in the fine-grain parallelism because of its 
advantages over coarse-grain parallelism. For our experiments, we select three 
well-known applications in order to measure the performance of our runtime 
libraries. These applications have been parallelized using the runtime libraries, 
with variable granularity. The parallelization was made by hand since currently 
the source code of a parallelizing compiler is not available. Additionally, we have 
implemented these applications using Windows kernel-level threads just to illus- 
trate their unsuitability for fine-grain parallelism. These applications are CMM, 
which performs complex matrix multiplication, BlockLU, which decomposes a 
dense matrix using blocking, and Raytrace, which renders a three-dimensional 
scene. The CMM is an application that has been previously used for the eval- 
uation of the Nanothreads Model [9,13] while the other two applications come 
from the Splash-2 benchmark suite [16]. 

To examine the behavior of CMM under several levels of granularity we use 
matrices of size 192x192 with variable chunk size for the outer-most loop. In 
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Figure 4. a, we present the speedups measured using chunk size equal to 4 and 
16, for our three implementations. For each number of processors we show six 
bars; the first three correspond to the execution with chuck of size 4 (fine-grain), 
while the other three to the execution with chunk size 16 (coarser-grain) . 

For the coarser granularity case, we observe that both user-level thread imple- 
mentations scale better than kernel-level threads implementation. Particularly, 
they achieve up to 7 % higher speedups. In the fine granularity case, these gains 
are increased resulting in 18% higher speedups. In both cases, the user-level 
thread implementations achieve similar performance. Although the CMM ap- 
plication is parallelized using chunk size 4, its granularity is still very coarse 
because in our implementations we parallelize only the outermost loop. 
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The second application, BlockLU divides the matrix into blocks. We use one 
task to represent the computation required for each block, so using different block 
sizes we can observe the performance of the application for different granularity. 
We execute BlockLU for matrix of size 1024x1024, using block sizes 32x32 and 
16x16, to express the coarse-grain and the fine-grain case respectively. Speedups 
for each case are illustrated in Figure 4.b using the same form as Figure 4. a. 

The speedup measurements for BlockLU exhibit larger divergences than for 
CMM, due to its finer granularity. During the execution of BlockLU, for a matrix 
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of size 1024x1024 and block size 16x16, several thousands of tasks are created. 
For the coarse-grain case, we observe that the speedups of both runtime library 
implementations are up to 24% higher, than those of the kernel-level thread im- 
plementation. In the fine-grain case, the kernel-level threads implementation does 
not scale for more than three processors. On the contrary, both user-level imple- 
mentations scale well; the NTLib implementation achieves 40% better speedup 
than the FibRT implementation, mainly due to its lower runtime overhead. 

Our last application, Raytrace, differs from the other applications due to its 
coarser granularity nature. Although Raytrace creates several tasks, the work 
associated to each one task is quite large, in terms of execution time. Conse- 
quently, the task management overhead can be considered negligible compared 
to the total execution time. We execute Raytrace with the car scene as an input. 
In Figure 4.c, the speedups of the Raytrace execution are illustrated. As we can 
see, while all the implementations scale well, the NTLib implementation achieves 
approximately 3% better speedup that the other implementations. 

Summarizing, the experiments make clear that for the exploitation of the fine- 
grain parallelism we must minimize the runtime overhead. Kernel-level threads 
are inefficient for that purpose due to their high overhead. Runtime libraries 
based on user-level thread packages provide adequate support, and low overhead 
for the efficient exploitation of the fine-grain parallelism. Furthermore, as shown 
by our application experiments the NTLib runtime library is more suitable than 
the corresponding FibRT. 



5 Related Work 

Although the Windows thread API provides the infrastructure for high perfor- 
mance multithreaded programming, only few runtime systems have been de- 
veloped until recently. The runtime systems presented below, concern only the 
platform of Windows NT/2000. Visual KAP [6] from Kuck and Associates Inc. is 
a commercial optimizing preprocessor for Windows NT /95 that provides a multi- 
processing runtime library. Another system for structured high-performance mul- 
tithreaded programming has been presented in [15]. It is based on the Sthreads 
library on top of the support that Windows NT provides. In both cases, there 
is no support for multilevel parallelism and wherever there are nested parallel 
loops, only the outermost is parallelized. Furthermore, they are based on Win- 
dows kernel-level threads only, which according to our experiments in Section 4 
are inappropriate for the exploitation of fine-grain parallelism. 

The Illinois-Intel Multithreading Library [4] supports various types of paral- 
lelism and extends the degree of available support for multithreading by provid- 
ing the capability to express nested loop, co-begin/co-end and DAG parallelisms. 
IML does support multiple levels of general, unstructured parallelism. IML uses 
Windows NT’s fibers as user-level threads, against our NTLib which defines 
its own threads with the advantages of lower overhead for creation and con- 
text switching and as a result can support fine-grain parallelism. However, the 
most important difference is that our system provides, according to the defined 
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programming model, the necessary functionality for multiprogramming support 
with arbitrary forms of parallelism. 

Although in Unix platforms there is a variety of custom-based user-level 
thread packages, this is not the case for the Windows NT platforms. The only 
one user-level thread package that we found is that of Cthreads [8]. However, 
there is no information given about the implementation and technical details of 
this porting. The development on Cthreads itself continues only to the extent 
that it supports ongoing projects. 

6 Conclusions and Future Work 

This paper presented the development of two multithreaded runtime libraries 
NTLib and FibRT, based on nanothreads and fibers respectively. The overhead 
of both runtime libraries is low enough to make the management of fine-grain 
parallelism affordable. The systematic comparison of the two libraries showed 
that nanothreads are more efficient lightweight user-level threads than fibers. Ex- 
periments with applications and application benchmarks indicate that NTLib 
provides more efficient parallelization and better scalability than FibRT. The 
main objective of our system is the effective integration of fine-grain parallelism 
exploitation and multiprogramming. Our future work is concentrated on the im- 
plementation of a kernel interface that provides a lightweight communication 
path between active user applications and the Windows operating system. This 
interface will support requests of resources from the user-level execution envi- 
ronment and will inform it of actual resource allocation and availability. The 
implementation of the kernel interface relies on shared memory as the commu- 
nication mechanism between the kernel and the application and vice versa. This 
mechanism and the scheduling policies are the major part of a series of modi- 
fications performed to Windows kernel and have already been implemented in 
other operating systems [9,10]. According to the above, we are in progress of 
implementing the necessary mechanism for integrating the kernel interface for 
multiprogramming support to our runtime system in the context of Windows 
2000 . 
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Abstract. Load balanced parallel radix sort solved the load imbalance 
problem present in parallel radix sort. Redistributing the keys in each 
round of radix, each processor has exactly the same number of keys, 
thereby reducing the overall sorting time. Load balanced radix sort is cur- 
rently known the fastest internal sorting method for distributed-memory 
multiprocessors. However, as the computation time is balanced, the com- 
munication time emerges as the bottleneck of the overall sorting perfor- 
mance due to key redistribution. We present in this report a new parallel 
radix sorter that solves the communication problem of balanced radix 
sort, called partitioned parallel radix sort. The new method reduces the 
communication time by eliminating the redistribution steps. The keys 
are first sorted in a top-down fashion (left-to-right as opposed to right- 
to-left) by using some most significant bits. Once the keys are localized 
to each processor, the rest of sorting is confined within each processor, 
hence eliminating the need for global redistribution of keys. It enables 
well balanced communication and computation across processors. The 
proposed method has been implemented in three different distributed- 
memory platforms, including IBM SP2, CRAY T3E, and PC Cluster. 
Experimental results with various key distributions indicate that par- 
titioned parallel radix sort indeed shows significant improvements over 
balanced radix sort. IBM SP2 shows 13% to 30% improvement while 
Cray/SGI T3E does 20% to 100% in execution time. PC cluster shows 
over 2.5 fold improvement in execution time. 



1 Introduction 

Sorting is one of the fundamental problems in computer science. Its use can 
be found essentially almost everywhere, be it scientific computation or non- 
numeric computation [8,9]. Sorting of a certain number of keys has been used 
in benchmarking various parallel computers or judging the specific algorithm 
performance when it is experimented on the same parallel machine. Serial sorts 
often need 0{N log N) time, and the time becomes significant as the number of 
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keys becomes large. Because of its importance, numerous parallel sorting algo- 
rithms have been developed to reduce the overall sorting time, including bitonic 
sort [1], sample sort [4,5], and column sort [7]. In general, parallel sorts consist 
of multiple rounds of serial sort, called local sort, performed in each processor in 
parallel, followed by moving keys among processors, called redistribution step [6]. 
Local sort and data redistribution may be intermixed and iterated a few times 
depending on the algorithms used. The time spent in local sort depends on the 
number of keys. Parallel sort time is the sum of the times of local sort and the 
times for data redistribution in all rounds. To make the sort fast, it is important 
to distribute the keys as evenly as possible throughout the rounds of sort, since 
the execution time is dependent on the most heavily loaded processor in each 
round [5,11]. If a parallel sort has made its work load balanced perfectly in each 
round, there would be no further improvement of the time spent in that part. 
However, the communication time varies depending on the data redistribution 
schemes (e.g. all-to-all, one-to-many, many-to-one), the amount of data and its 
frequency of communication (e.g. many short messages, or a few long messages), 
and network topologies (hypercube, mesh, fat-tree) [8,3]. It was reported that 
for a large number of keys, the communication times occupy a great portion of 
the sorting time[3,12]. Load balanced parallel radix sort [11] or LBR, reduces the 
execution time by perfectly balancing the load among processors in every round. 
Partitioned parallel radix sort or PPR, proposed in this paper, further improves 
the performance by reducing the multiple rounds of data redistribution to one. 
While partitioned radix sort may introduce slight load imbalance among pro- 
cessors due to its not-so-perfect key distribution, the overall performance gain 
can be of particular significance since it substantially reduces the overall com- 
munication time. It is precisely the purpose of this report to introduce this new 
algorithm that features balanced computation and balanced communication. 

The paper is organized as follows. Section 2 briefly explains balanced parallel 
radix sort and identifies its deficiency in terms of communication. Section 3 
presents a new partitioned parallel radix sort. Section 4 lists the experimental 
results of the algorithm on three different distributed-memory parallel machines 
including SP2, T3E and PC cluster. The last section concludes this report. 

2 Parallel Radix Sort 

Radix sort is a simple yet very efficient sorting method that outperforms many 
well known comparison-based algorithms. Suppose N keys are evenly distributed 
to P processors initially such that there are n = ^ keys per processor. When 
sort completes, we expect that all keys are ordered according to the rank of 
processors Pq, Pi, ■■ ■, Pp-i, besides keys in each processor have also been sorted. 
Serial radix sort is implemented in two different ways: radix exchange sort and 
straight radix sort [10]. Since parallel radix sorts are typically derived from a 
serial radix sort, we first briefly describe them, and ideas of parallelization will 
be given. We define some symbols used later in this paper as listed below: 
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— & is the number of bits of an integer key such that an integer key is represented 
as {ib-iib-2 ■ ■ ■ *i*o)- 

— g is the group of bits used at each round of scairning. 

— r is the number of rouirds each key goes through, also the number of rouirds 
in radix sort. 

Radix exchange sort geirerates and maintaiirs air ordered queue. Initially, it 
reads the g least significant bits (in other words, ig-iig-2 ■ ■ ■ iiio) of each key, 
and stores it in a new queue at the location determined by the g bits. If all 
keys are examined and placed in the new queue, the round completes. Keys 
are ordered according to their least significant g bits. The following rounds use 
the next g least significant bits (123-1^29-2 • ■ • *g+i*g) to order them in a new 
queue. Keys move back and forth during the round. In the following rounds, the 
same operations are done as before. After r rounds (where r = (^]), all bits are 
scanned, and the sort is complete. LBR parallelizes the sort by repeating the 
following process a given number of times: it builds an ordered queue globally 
by using the least-significant g bits of keys, then divides it into P equal sized 
segments, allocates them to all processors. In other words, a globally ordered 
queue is created, then divided equally, and each is assigned to one processor. In 
the next round a new ordered queue is again generated using the next g bits 
by P processors together, then divided equally, and each is distributed. Load 
of processors is always perfectly balanced in this scheme because of the same 
number of keys given in each processor. LBR is reported to outperform fastest 
parallel sorts by up to 100% in execution time [11]. LBR, however, requires data 
redistribution across processors iir every round, thus, it consumes a considerable 
amouirt of time in communication. 

Straight radix sort initially uses M = 2® buckets instead of the or- 
dered queues. It first bucket-sorts [10] keys using the g most significant bits 
{ib-iib-2 ■ ■ ■ ib-g) of each key. Bucket-sort puts keys into buckets whose index 
corresponds to the g bits. Thus, keys with the same g bits gather in the same 
bucket. Similarly iir the secoird roriird, keys in each bucket are bucket-sorted 
again using the next g most significant bits {ib-(^g+i)ib-{g+2) ‘ ‘ 'ib-2g), generat- 
iirg M new subbuckets per bucket. The remaiiring rounds are done in the same 
mairner. In this scheme keys irever leave the bucket where they have been placed 
iir a previous round. One significant problem in the scheme is that the number of 
overall buckets(subbuckets) explodes quickly, and many buckets with few keys 
waste a lot of resource (memory) if not carefully implemented. 

In parallel implentation, the first round is done exactly the same as the se- 
rial straight radix sort. Then, each processor is assigned and will be in charge 
of a few consecutive buckets obtained in the first round. Now, buckets are ex- 
changed among processors according to their index, thus keys with the same g 
most significant bits are collected from all processors into one. In the remaining 
r — 1 rounds, bucket sorts continues locally by using b — g bits without data 
exchange. If keys are evenly distributed among buckets, each processor will hold 
M/P buckets in average. However, it is possible that some processors may be 
allocated with buckets with a lot of keys while others have few, depeirding on 
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the distribution characteristics of the keys. This static/naive partitioning of keys 
may cause severe load imbalance among processors. PPR solves this problem as 
described in the next section. 



3 Partitioned Parallel Radix Sort 

Assume that we use only M = 2® buckets per processor throughout the sort. 

Bij represents bucket-j in processor Pi. PPR consists of two phases: local sort 

and key partitioning. PPR needs r = rounds in all. Details are as follows: 

I. Key Partitioning 

Each processor bucket-sorts its keys using the g most significant bits. From 
now on, the left-most g bits of a key is represented by the most significant 
digit (MSB). Each key is placed into an appropriate bucket, thus, proces- 
sor Pk stores a key to bucket Bkj, where j corresponds to the MSD of the 
key. At the end of the bucket sort, all keys have been partially sorted inter- 
nally with respect to their MSDs, i.e. the first bucket includes the smallest 
keys, the second the next smallest, • • •, and the last the largest. Then, pro- 
cessors collect the key counts of their buckets together to find a global key 
distribution map as follows. An illustratation is given in Figure 1. 

For all j = 0, 1, • • • , M— 1, key counts of Bkj are added up to get Gj, a 
global count of keys in all buckets Bkj across processors (fc=0,l, • • •, 

P — 1). Then prefix sums of global key counts of GjS are computed. 
Let’s consider hypothetical buckets (called global buckets) GBjS which are a 
collection of jth buckets of Bkj from all processors Pq, Pi, ■ ‘ Pp-i- Then Gj 
corresponds to the key count of bucket GBj . Taking into account the prefix 
sums and the average number of keys (n = N/P), global buckets are to be 
divided into P groups, each having one or more consecutive buckets, in such 
a way that the key counts of each group become as equal as possible. The 
first group consists of first few buckets GBq, GBi,- ■ ■ GBk-i whose counts 
add up to approximately n keys, the second GBk, GBk+i, ■ ■ ■ GBi again to 
have approximately n keys, etc. jth group of buckets is now allocated Pj, 
which becomes the owner of the buckets. 

Now all processors send their buckets of keys to their owners simultaneously 
except those whose owner is itself. After this movement, keys are sorted 
partially across processors, since any key in GBi is smaller than any key in 
GBj for i < j. Note that keys have not been sorted locally yet. 

II. Local Sort 

Keys in each processor are now sorted locally at a time by all processors, 
to make all N keys in order. Serial radix exchange sort is performed at first 
with the rightmost g bits, then, with the next rightmost g bits, • • •, until all 
b — g bits are used up. Only b — g bits are examined because the left most g 
bits have already been used in Phase I. Phase II needs \{b — g)/g~\ rounds. 
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key count 



bucket 

index 





(a) Bucket counts in individul processors 




(b) Global bucket count and the partitioning map 

Fig. 1. Local and global key count maps and bucket partitioning 



The performance of PPR relies on how evenly the keys are distributed in the first 
phase. It is not very likely that each processor gets exactly the same number of 
keys. Refinement of the partitioning of keys can be made in Phase I by further 
dividing the buckets that lie in the partition boundary and have more counts 
than needed for even partitioning. However, splitting a bucket and allocating to 
two neighboring processors could not produce the desired sorted output by Phase 
II, since keys having the same MSD would stay in different processors. Thus, we 
avoid splitting buckets further, and keys should be distributed to processors 
bucket by bucket, on which our refinement method is based, as explained below. 

PPR resembles sample sort [4,5] in the aspect of data partitioning and local 
sort. In sample sort, after keys have moved according to the splitters (pivots) to 
each processor, they are partially ordered across processors, thus further move- 
ment of keys across processors is not needed. One significant difference is in that 
the global key distribution statistic in sample sort is not known until keys actu- 
ally move to designated processors, while in PPR it is known before the costly 
data movement. Thus, it is possible to adjust the partitioning before the actual 
key movement. If current partitioning is not likely to giving satisfactory balance 
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in work load, PPR increases g so that the keys in the boundary buckets can 
spread out further into a larger number of sit&-buckets to produce an evener 
partition. For example, if g is increased by 2 (bits), the keys in each boundary 
bucket are splitted into four buckets, enabling finer partitioning. The process 
repeats until a satisfactory partitioning is obtained. 

4 Experiments and Discussion 

PPR is experimented on three different parallel machines: IBM SP2, Cray T3E, 
and PC cluster. PC cluster is a set of 16 personal computers with 300MHz 
Pentium-II CPUs interconnected by a 100Mbps fast ethernet switch. T3E is the 
fastest machine among them, as long as the computational speed is concerned. 
As inputs of sort, N/P keys are synthetically generated in each processor with 
the distribution characteristics called uniform, gauss, and stagger [11]. Uniform 
creates keys with uniform distribution. Gauss forms keys with Gaussian distri- 
bution. Stagger produces specially distributed keys as described in [4]. We run 
the programs onto up to 64 processors, each with maximum of 64M keys. Keys 
are 32-bit integers for SP2 and PC cluster, and 64-bit integers for T3E. Code is 
written in C with MPI communication library [13]. Among many experiments 
we have performed, only a few representative results are shown here. 

We first verify that PPR can reduce the communication time while it tolerates 
load imbalance. We expect the communication time be cut down to 1/4 and 1/8 
at maximum with g = 4,8, compared to LBR for sorts of 32-bit and 64-bit 
integer keys, respectively. As seen in Figures 2-3, there is a great reduction in 
communication times: they are now about 1/4 for 32-bit keys in SP2, and around 
1/6 for 64 bit integers in T3E. 

The load imbalance among processors is shown in Figure 4. It is the greatest 
for the case of Gauss, with maximum difference of 5.2% against the perfect 
balanced case, which proves it is not so severe as to significantly impair the 
overall performance of PPR. Improved performance of PPR over LBR can be 
observed in Figures 5 &6 for SP2, and Figures 7 & 8 for T3E. 

We have found that in T3E the communication portion in sorting time is 
greater than SP2. In addition, since the keys are 64-bit integers in T3E, more 
improvement of PPR over LBR is expected due to larger r because we save 
r — \ rounds of interprocessor communication. More enhancement on T3E can 
be observed in Figures 7 & 8 compared to Figures 5 & 6, respectively. In PC 
cluster, the network is so slow that the two parallel sorts are slower than the 
uniprocessor sort for the cases of P > 8 as shown in Figures 9-10. Nevertheless, 
PPR delivers remarkable performance over LBR since the communication time 
dominates the computation time. Table 1 lists the performance. 
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4 8 16 

Number of processors 



Fig. 2. Comparison of communication times of PPR and LBR on SP2 with 
gaussian distribution 



Uniform (T3E) 




Number of processors 



Fig. 3. Comparison of communication times on T3E with uniform distribution 
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Gaussian (SP2) 
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Fig. 4. Percentage deviation of work load from perfect 
gaussian distribution 
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Fig. 5. Execution times on SP2 with uniform distribution 
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Fig. 6. Execution times on SP2 with gaussian distribution 
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Fig. 7. Execution times on T3E with uniform distribution 
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Fig. 8. Execution times on T3E with gaussian distribution 
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Fig. 9. Execution times on PC cluster with uniform distribution 
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Fig. 10. Execution times on PC cluster with stagger distribution 



5 Conclusion 

This paper has proposed the partitioned parallel radix sort, which reduces the 
communication bottleneck of balanced radix sort. The main idea is to divide 
the keys to processors in a way that each processor holds keys that are sorted 
across processors but not within each processor. Upon localization of keys to 
each processor, serial radix sort is applied to each for locally sorting the as- 
signed keys. The method thus improves the overall performance by reducing 
the significant portion of communication time. Experimental results on three 
distributed-memory machines have indicated that partitioned parallel radix sort 
always performs better than the previous scheme regardless of data size, the 
number of processors, and key initialization schemes. 
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Abstract. A praetieal three-dimensional shape optimization for 
aerodynamie design of a transonie wing has been performed using 
Evolutionary Algorithms (EAs). Beeause EAs eoupled with 
aerodynamie funetion evaluations require enormous eomputational 
time, Numerieal Wind Tunnel (NWT) loeated at National Aerospaee 
Laboratory in Japan has been utilized based on the simple master-slave 
eoneept. Parallel proeessing makes EAs a very promising approaeh for 
praetieal aerodynamie design. 



1 Introduction 

Most of eommereial airerafts today eruise at transonie speeds. During the long 
duration of eruise, engine thrust is applied to maintain airerafl speed against 
aerodynamie drag. Sinee a large part of their maximum takeoff weights is oeeupied 
by the fuel weight, the objeetive of an aerodynamie design optimization of a transonie 
wing is, in prineiple, minimization of drag. 

Unfortunately, drag minimization has many tradeoffs. There is a tradeoff between 
drag and lift beeause one of the drag eomponent ealled indueed drag inereases in 
proportion to the square of the lift. A wing that aehieves no indueed drag would have 
no lift. Another tradeoff lies between aerodynamie drag and wing strueture weight. 
An inerease in the wing thiekness allows the same bending moment to be earried with 
redueed skin thiekness with an aeeompanying reduetion in weight. On the other hand, 
it will lead to an inerease in another eomponent of the drag ealled wave drag. 
Therefore, the aerodynamie design of a transonie wing is a ehallenging problem. 

Furthermore, optimization of a transonie wing design is diffieult due to the 
followings. First, aerodynamie performanee of a wing is very sensitive to its shape. 
Very preeise definition of the shape is needed and thus its definition usually requires 
more than 100 design variables. Seeond, funetion evaluations are very expensive. An 
aerodynamie evaluation using a high fidelity model sueh as the Navier-Stokes 
equations usually requires 60-90 minutes of CPU time on a veetor eomputer. 

M. Valero et al. (Eds.): ISHPC 2000, LNCS 1940, pp. 172-181, 2000. 

© Springer-Verlag Berlin Heidelberg 2000 
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Among optimization algorithms, Gradient-based Methods (GMs) are well-known 
algorithms, whieh probe the optimum by calculating local gradient information. 
Although GMs are generally superior to other optimization algorithms in efficiency, 
the optimum obtained from these methods may not be a global one, especially in the 
aerodynamic optimization problem. 

On the other hand. Evolutionary Algorithms, in particular. Genetic Algorithms 
(GAs) are known to be robust methods modeled on the mechanism of the natural 
evolution. GAs have capability of finding a global optimum because they don’t use 
any derivative information and they search from multiple design points. Therefore, 
GAs are a promising approach to aerodynamic optimizations. 

Finding a global optimum in the continuous domain is however challenging for 
GAs. In traditional GAs, binary representation has been used for chromosomes, which 
evenly discretizes a real design space. Since binary substrings representing each 
parameter with a desired precision are concatenated to form a chromosome for GAs, 
the resulting chromosome encoding a large number of design variables for real-world 
problems would result in a string length too long. In addition, there is discrepancy 
between the binary representation space and the actual problem space. For example, 
two points close to each other in the real space might be far away in the binary- 
represented space. It is still an open question to construct an efficient crossover 
operator that suits to such a modified problem space. 

A simple solution to these problems is the use of floating-point representation of 
parameters as a chromosome [1]. In these real-coded GAs, a chromosome is coded as 
a finite-length string of the real numbers corresponding to the design variables. The 
floating-point representation is robust, accurate, and efficient because it is 
conceptually closest to the real design space, and moreover, the string length reduces 
to the number of design variables. It has been reported that the real-coded GAs 
outperformed binary-coded GAs in many design problems [2]. However, even the 
real-coded GAs would lead to premature convergence when applied to aerodynamic 
shape designs with a large number of design variables. 

To apply GAs to practical, large-scale engineering problems, the idea of dynamic 
coding, in particular Adaptive Range GAs [3,4], is incorporated with the used of the 
floating-point representation. The objective of the present work is to apply the 
resulting approach to a practical transonic wing design and to demonstrate the 
feasibility of the present approach. 



2 Adaptive Range Genetic Algorithms 

To treat a large search space with GAs more efficiently, sophisticated approaches 
have been proposed, referred to as dynamic coding, which dynamically alters the 
coarseness of the search space. In [5], Krishnakumar et al. presented Stochastic 
Genetic Algorithms (Stochastic GAs) to solve problems with a large number of real 
design parameters efficiently. Stochastic GAs have been successfully applied to 
Integrated Flight Propulsion Controller designs [5] and air combat tactics 
optimization [6]. As they mentioned, the Stochastic GAs bridge the gap between ES 
and GAs to handle large design problems. 
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Adaptive Range Genetie Algorithms (ARGAs) proposed by Arakawa and 
Hagiwara [3] are a quite new approach, also using dynamic coding for binary-coded 
GAs to treat continuous design space. The essence of their idea is to adapt the 
population toward promising regions during the optimization process, which enables 
efficient and robust search in good precision while keeping the string length small. 
Moreover, ARGAs eliminate a need of prior definition of search boundaries since 
ARGAs distribute solution candidates according to the normal distributions of the 
design variables in the present population. In [4], ARGAs have been applied to 
pressure vessel designs and outperformed other optimization algorithms. 

Since the ideas of the Stochastic GAs and the use of the floating point 
representation are incompatible, ARGAs for floating point representation are 
developed. The real-coded ARGAs are expected to possess both advantages of the 
binary-coded ARGAs and the floating point representation to overcome the problems 
of having a large search space that requires continuous sampling. 



2.1 ARGAs for Binary Representation 

When conventional binary-coded GAs are applied to real-number optimization 
problems, discrete values of real design variables /?, are given by evenly discretizing 
prior-defined search regions for each design variable [ , Pi,max ] according to the 

length of the binary substring bn as 

. 

Pi ~ (/^i,max ~ Pi,m\a) 7 (1) 

where si represents string length and 

/=! 

In binary-coded ARGAs, decoding 
rules for the offspring are given by the 
following normal distributions, 

N'(/Ui,(7 i^){p i) = 4^(7 i ■ i^)iPi) = (2) 

where the average //, and the standard deviation o; of each design variable are 
determined by the population statistics. Those values are recomputed in every 
generation. Then, mapping from a binary string into a real number is given so that the 
region between N’ub and N’ib in Fig. 1 is divided into equal size regions according to 
the binary bit size as 

Pi - 2a,' • + {Kb ~ Kb ) ^77^) M c,<2*'-'-l 

p, + 2a/ ■ ln(N[,B ~ {Kb ~ Kb ) ^ 71 ^) K 




Fig. 1 Decoding for original ARGAs 



Pi = 



( 3 ) 
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where N’us&nd N’ib are additional system parameters defined in [0,1]. In the ARGAs, 
genes of design candidates represent relative locations in the updated range of the 
design space. Therefore, the offspring are supposed to represent likely a range of an 
optimal value of design variables. 

Although the original ARGAs have been successfully applied to real parameter 
optimizations, there is still room for improvements. The first one is how to select the 
system parameters N’ub and N’lb on which robustness and efficiency of ARGAs 
largely depend. The second one is the use of constant intervals even near the center of 
the normal distributions. The last one is that since genes represent relative locations, 
the offsprings become constantly away from the centers of the normal distributions 
when the distributions are updated. Therefore, the actual population statistics does not 
coincide with the updated population statistics. 



2.2 ARGAs for Floating-Point Representation 



In real-coded Gas, real values of design 
variable are directly encoded as a real string 
ri,pi=ri where < r,- < . 

Otherwise, sometimes normalized values of 
the design variables are used as 



Pi = iPi,n 



.)-ri+ Pi. 



( 4 ) 




where 0<ri<l. 

To employ floating-point representation 
for ARGAs, the real values of design 
variables /?, are rewritten here by the real 
numbers r, defined in (0,1) so that integral 
of the probability distribution of the normal 
distribution from -oo to /?«, is equal to r, as 
Pi = <T,' • prii + Pi (5) 

r,. = r‘A(0,l)(z)rfz (6) 

J— CO 

where the average //, and the standard 
deviation o; of each design variable are 
calculated by sampling the top half of the 
previous population so that the present 
population distributes in the hopeful search 
regions. Schematic view of this coding is 
illustrated in Fig. 2. It should be noted that 
the real-coded ARGAs resolve drawbacks of 
the original ARGAs; no need for selecting 

PI’uB and P!\b as well as arbitrary resolution near the average. Updating //, and o; 
every generation, however, results in inconsistency between the actual and updated 
population statistics in the next generation because the selection operator picks up the 
genes that correspond to the promising region according to the old population 




Fig. 3 Flowchart of ARGA 
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statistics. To prevent this inconsistency, the present ARGAs update fx and cr every M 
(M>1) generations and then the population is reinitialized. Flowchart of the present 
ARGA is shown in Fig. 3. 

To improve robustness of the present ARGAs further, relaxation faetors, and 
Wa are introduced to update the average and standard deviation as 



^ new ^present ^ sampling present^ 



( 7 ) 



^present ^cr sampling ^present ) 



( 8 ) 



where jUsampUng and asampimg are determined by sampling the top half of the population. 
Flere, co^, and M are set to 1, 0.5 and 4, respectively. They are determined by 
parametrie studies using some simple test funetions. 

In this study, design variables are encoded in a finite-length string of real numbers. 
Fitness of a design candidate is determined by ist rank among the population based on 
ist objeetive flinetion value and then selection is performed by the stoehastic universal 
sampling [7] coupled with the elitist strategy. Ranking selection is adopted since it 
maintains sufficient selection pressure throughout the optimization. One -point 
crossover is always applied to real-number strings of the selected design candidates. 
Mutation takes place at a probability of 0. 1, and then a uniform random disturbance is 
added to the corresponding gene in the amount up to 0. 1 . 



2.3 Test Problem Using a Multi-modal Function 



To demonstrate how the real-coded ARGA works, it was applied to minimization of a 
high dimensional multi-modal function: 

20 

FI = ^ (x 7 + 5(1 - cos(x, • ;r)) (9) 

i=l 



where x, e [-3,3] . This function has a global minimum at x,“0 and two local optima 
nearx, = ±2 . In the real-coded ARGA, x, correspond to pi in eq.(5). 150 generations 
were allowed with a population size of 300. Five trials were run for each GA 
changing seeds for random numbers to give different initial populations. Figure 4 





GENERATION 



Fig. 4 Comparison of convergence histories pig. 5 Comparison of convergence histories 

of X; between GA (above) and ARGA 
(below) 
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compares the performances of the conventional GA and the ARGA. Figure 5 plots all 
V/’s from the temporary solutions, which helps to understand why the ARGA works 
better than the conventional GA. This figure shows that the ARGA maintains gene 
diversity longer than the conventional GA in the initial phase and then adapts to their 
search space to the local region near the optimal. While the initial gene diversity 
contributes to the ARGA’ s robustness, the adaptive feature of the ARGA improves 
their local search capability. The ARGA also showed its advantages over a real-coded 
GA on dynamic control problem and aerodynamic airfoil shape optimization [8]. 



3 Aerodynamic Design of a Transonic Wing 



A wide range of approximations can represent the flow physics. Among them, the 
Navier-Stokes equations provide the state -of-the-aft of aerodynamic performance 
evaluation for engineering purposes. Although the three-dimensional Navier-Stokes 
calculation requires large computer resources to estimate wing performances within a 
reasonable time, it is necessary because a flow around a wing involves significant 
viscous effects, such as potential boundary-layer separations and shock 
wave/boundary layer interactions in the transonic regime. Here, a three-dimensional 
Reynolds-averaged Navier-Stokes solver [9] is used to guarantee an accurate model of 
the flow field and to demonstrate the feasibility of the present algorithm. 

The objective of the present wing design problem is maximization of lift-to-drag 
ratio L/D at the transonic cruise design point, maintaining the minimum wing 
thickness required for structural integrity against the bending moment due to the lift 
distribution. The cruising Mach number is set to 0.8. The Reynolds number based on 
the chord length at the wing root is assumed to 10^. 

In the present optimization, a planform shape of generic transport was selected as 
the test configuration (Fig. 6). Wing profiles of design candidates are generated by the 
PARSEC airfoils as briefly described in the next section. The PARSEC parameters 
and the sectional angle of attack (in other words, root incident angle and twist angle) 
are given at seven spanwise sections, of which spanwise locations are also treated as 
design variables except for the wing root and tip locations. The PARSEC parameters 
are rearranged from root to tip according to the airfoil thickness so that the resulting 
wings always have maximum thickness at the wing root. The twist angle parameter is 
also rearranged into numerical order from tip to root. The wing surface is then 
interpolated in spanwise direction by using the second-order Spline interpolation. 

In total, 87 parameters determine a wing geometry. Parameter ranges of the design 
space are shown in Table 1. It should be noted that in ARGAs, user-defined design 
space is used just to seed the initial population. ARGA can promote the search space 
outside of the initially defined design space. 

To estimate the required thickness distribution to stand the bending moment due to 
the lift distribution, the wing is modeled by a thin walled box-beam as shown in Fig. 
6. The constraint for wing thickness t/ is specified by using the minimum thickness 
tmin calculated from the wing box sustaining the aerodynamic bending moment M as. 



M 






( 10 ) 
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where following assumptions are made: the thiekness of the skin panels are 2.5[cm] 
and its ultimate normal stress auinmate is 39[ksi]. The length of the ehord at wing root c 
and maximum wingspan b/2 are 10[m] and 18.8[m], respeetively (for the derivation 
of Eq. (10), see [10] for example). 




Fig. 6 Wing geometry definition. Planform shape is frozen during the optimization. Wing box 
is used to estimate its structural strength. PARSEC parameters are the design variables for 
airfoil shapes defined at seven spanwise sections 



Table 1 Parameter ranges of the design space. PARSEC is determied by leading-edge radius 
(rLe), upper and lower crest locations including curvatures (Xup, Zup, Zxxup Xpo, Zpo, Zxxlo), 
trailing-edge ordinate (Zje) and thickness (AZje) and direction and wedge angles (apE, Pte) 



parameters 


TlE 


Zte 


«TE 




Xup 


Zup 


Zxxup 


Xlo 


Zlo 


ZxXLO 


twist angle 




0.030 


0.01 


-3.0 


8.0 


0.7 


0.18 


0.0 


0.6 


0.02 


0.9 


7 deg 




0.002 


-0.01 


-13.0 


4.0 


0.3 


0.08 


-0.3 


0.2 


-0.04 


0.3 


-1 deg 



3.1 PARSEC Airfoils 

An airfoil family “PARSEC” has been reeently proposed to parameterize an airfoil 
shape [11]. A remarkable point is that this teehnique has been developed aiming to 
eontrol important aerodynamic features effectively by selecting the design parameters 
based on the knowledge of transonic flows around an airfoil. 

Similar to 4-digit NACA series airfoils, the PARSEC parameterizes upper and 

6 

lower airfoil surfaces using polynomials in coordinates A, Z as, Z= Jfn where 

n=\ 

a„ are real coefficients. Instead of taking these coefficients as design parameters, the 
PARSEC airfoils are defined by basic geometric parameters: leading-edge radius 
(rpE), upper and lower crest locations including curvatures (Xyp, Zyp, Zxxup Xpo, Zpo, 
ZxxLo), trailing-edge ordinate (Zte), thickness (AZte) and direction and wedge angles 
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(axE, Pte) as shown in Fig. 6. These parameters can be expressed by the original 
coefficients a„ by solving simple simultaneous equations. Eleven design parameters 
are required for the PARSEC airfoils to define an airfoil shape in total. In the present 
case, the trailing-edge thickness is frozen to 0. Therefore, ten design variables are 
used to give each spanwise section of the wing. 



3.2 Optimization Using Real-Coded ARGA 



Because the objective function distribution of the present optimization is likely to be 
more complex than the above test function minimization, the relaxation factor is 
now set to 0.3. The structured coding coupled with one-point crossover proposed in 
[12] is also incorporated. The present ARGA adopts the elitist strategy where the best 
and the second best individuals in each generation are transferred into the next 
generation without any crossover or mutation. The parental selection consists of the 
stochastic universal sampling and the ranking method using Michalewicz’s nonlinear 
function. Mutation takes place at a probability of 10% and then adds a random 
disturbance to the corresponding gene in the amount up to ± 10% of each parameter 
range in Table 1. The population size is kept at 64 and the maximum number of 
generations is set to 65 (based on the CPU time allowed). The initial population is 
generated randomly over the entire design space. 

The main concern related to the use of GAs coupled with a three-dimensional 
Navier-Stokes solver for aerodynamic designs is the computational cost required. In 
the present case, each CFD evaluation takes about 100 min. of CPU time even on a 
vector computer. Because the present optimization evaluates 64 x 65 = 4160 design 
candidates, sequential evolutions would take almost 7000 h (more than nine months!). 

Fortunately, parallel vector computers are now available at several institutions and 
universities. In addition, GAs are intrinsically parallel algorithms and can be easily 
parallelized. One of such computers is Numerical Wind Tunnel (NWT) located at 
National Aerospace Laboratory in Japan. NWT is a MIMD parallel computer with 
166 vector-processing elements (PEs) and its total peak performance and the total 
main memory capacity are about 280 GFLOPS and 45GB, respectively. For more 
detail, see [13]. In the present optimization, evaluation process at each generation was 
parallelized using the master-slave concept. This made the corresponding turnaround 
time almost 1/64 because the CPU time used for GA operators are negligible. 

To handle the structural constraint with the single-objective GA, the constrained 
optimization problem was transformed into an unconstrained problem as 



fitness function 



I 100 + Z/Z) if 

[(100 + L! D)- exp(t - ) otherwise 



( 11 ) 



where t and are thickness and minimum thickness at the span station of the 
maximum local stress. The exponential term penalizes the infeasible solutions by 
reducing the fitness function value. Because some design candidates can have 
negative LID, the summation of 100 and LID is used. 
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3.3 Results 




Fig. 7 Optimization history 



The optimization history of the present ARGA is shown in Fig. 7 in terms of LID. 
During the initial phase of the optimization, some members had a strong shock wave 
or failed to satisfy the structural constraint. However 
they were weeded out from the population because of 
the resultant penalties to the fitness function. The final 
design has LID of 18.91 (Q = 0.26213 and Co = 

0.01386) satisfying the given structural constraint. 

Turnaround time of this optimization was about 108 h 
on NWT. 

To examine whether the present optimal design is 
close to a global optimum, we have checked it against 
analytically and empirically established design 
guidelines. In aerodynamics, spanwise lift distribution 
should be elliptic to minimize the induced drag. 

However, the structural constraint leads to a tradeoff between induced drag and wave 
drag. This enforces the spanwise lift distribution to be linear rather than elliptic. The 
present solution does have a linear distribution. To produce this distribution, a wing is 
usually twisted in about five degrees. The present wing is twisted in six degrees. 

Figure 8 shows the designed airfoil sections and the corresponding pressure 
distributions at the 0, 33, and 66% spanwise locations. In the pressure distributions, 
neither any strong shock wave nor any flow separation is found. This ensures that the 
present wing has very little wave drag and pressure drag. At 33 and 66% spanwise 
locations, the rooftop, front-loading and rear loading patterns are observed, which are 
typical for the supercritical airfoils [14] used for advanced transport today. The 
corresponding airfoil shapes are indeed similar to supercritical airfoils. Overall, these 
detailed observations of the design confirm that the present design is very close to a 
global optimum expected by the present knowledge in aerodynamics. 




Fig. 8 Designed airfoil sections and corresponding pressure distributions 



4 Summary 

To develop GAs applicable to practical aerodynamic shape designs, the real-coded 
ARGAs have been developed by incorporating the idea of the binary-coded ARGAs 
with the use of the floating-point representation. The real-coded ARGA has been 
applied to a practical aerodynamic design optimization of a transonic wing shape for 
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generic transport as well as a simple test case. The test case result confirms the 
present GA outperforms the conventional GA. 

Aerodynamic optimization was performed with 87 real-number design variables by 
using the Navier-Stokes code. The realistic structural constraint was imposed. The 
resulting wing appears very similar to advanced wing designs based on supercritical 
airfoils. The straight span load distribution of the resulting design represents a 
compromised design between minimizations of induced drag and wave drag. The 
designed wing also has a fully attached flow and the allowable minimum thickness so 
that pressure drag and wave drag are minimized under the present structural 
constraint. These results confirm the feasibility of the present approach for future 
applications. 



References 

1. Michalewicz, Z., Genetic Algorithms + Data Structures = Evolution Programs, third 
revised edition, Springer- Ver lag, (1996). 

2. Janikow, C. Z. and Michalewicz, Z., An Experimental Comparison of Binary and Floating 
Point Representations in Genetic Algorithms, Proc. of the 4*'’ Inti. Conference on Genetic 
Algorithms, (1991), pp.31-36. 

3. Arakawa, M. and Hagiwara, I., Development of Adaptive Real Range (ARRange) Genetic 
Algorithms, JSME Inti. J„ Series C, Vol. 41, No. 4 (1998), pp. 969-911 . 

4. Arakawa, M. and Hagiwara, L, Nonlinear Integer, Discrete and Continuous Optimization 
Using Adaptive Range Genetic Algorithms, Proc. of 1997 ASME Design Engineering 
Technical Conferences, (1997). 

5. Krishnakumar, K., Swaminathan, R., Garg, S. and Narayanaswamy, S., Solving Large 
Parameter Optimization Problems Using Genetic Algorithms, Proc. of the Guidance, 
Navigation, and Control Conference, (1995), pp.449-460. 

6. Mulgund, S., Harper, K., Krishnakumar, K. and Zacharias. G., Air Combat Tactics 
Optimization Using Stochastic Genetic Algorithms, Proc. of 1998 IEEE Inti. Conference 
on Systems, Man, and Cybernetics, (1998), pp.3 136-3 141. 

7. Baker, J. E., Reducing Bias and Inefficiency in the Selection Algorithm, Proc. of the 2nd 
Inti. Conference on Genetic Algorithms, (1987), pp. 14-21. 

8. Oyama, A., Obayashi, S. and Nakahashi, K., Wing Design Using Real-Coded Adaptive 
Range Genetic Algorithm, Proc. of 1999 IEEE Inti. Conference on Systems, Man, and 
Cybernetics [CD-ROM], (1999). 

9. Obayashi, S. and Guruswamy, G. P., “Convergence Acceleration of an Aeroelastic Navier- 
Stokes Solver,” Journal, Vol. 33, No. 6, 1995, pp.l 134-1 141. 

10. Case, J., Chilver, A. H. and Ross, C. T. F., Strength of Materials & Structures with an 
Introduction to Finite Element Methods, 3''* Edn., Edward Arnold, London, 1993. 

1 1 . Sobieczky, H, Parametric Airfoils and Wings, Recent Development of Aerodynamic 
Design Methodologies -Inverse Design and Optimization -, Friedr. Vieweg & Sohn 
Verlagsgesellschaft mbH, Braunschweig/Wiesbaden, (1999), pp.72-74. 

12. Oyama, A., Obayashi, S., Nakahashi, K. and Hirose, N., Fractional Factorial Design of 
Genetic Coding for Aerodynamic Optimization, AIAA Paper 99-3298, (1999). 

13. Nakamura, T., Iwamiya, T., Yoshida, M., Matsuo, Y. and Fukuda, M., Simulation of the 3 
Dimensional Cascade Flow with Numerical Wind Tunnel (NWT), Proc. of the 1996 
ACM/IEEE Supercomputing Conference [CD-ROM], (1996). 

14. Harris, C. D., NASA Supercritical Airfoils, NASA TP 2969, (1990). 




A Common CFD Platform UPACS 



Hiroyuki Yamazaki', Shunji Enomoto^, and Kazuomi Yamamoto^ 

^ Computational Science Division, National Aerospace Laboratory, 
7-44-1 Chofu, Tokyo, Japan 
yamazaki@nal .go.jp 
^ Aeroengines Division, National Aerospace Laboratory, 

7-44-1 Cbofu, Tokyo, Japan 
enoSnal .go.jp 

^ Aeroengines Division, National Aerospace Laboratory, 

7-44-1 Cbofu, Tokyo, Japan 

kazuomional .go.jp 

Abstract. NAL(National Aerospace Laboratory) is developing a 
common parallel CFD Platform UPACS(Unified Platform for 
Aerospace Computational Simulation), for the purpose of the effi- 
cient CFD programming and the aggregation of the CFD technol- 
ogy of NAL. UPACS is coded based on three-dimensional Navier- 
Stokes equations and supports multi-block grids. It is parallelized 
by MPI message passing library. In this paper the concept and the 
structure of UPACS is described and its computational perfor- 
mance is evaluated on NAL NWT(Numerical Wind Tunnel) 
supercomputer. 



1 Introduction 

UPACS (Unified Platform for Aerospace Computational Simulation ) is intro- 
duced, which NAL is developing for the efficient CFD programming and the 
aggregation of CFD technology of NAL. The parallel and vector performance of 
the current version is evaluated on NAL NWT vector parallel supercomputer 
with simple test cases. 

The progress in CFD and parallel computers during 1990s enabled massive 
numerical simulations of flow around realistic complicated aircraft configura- 
tions, direct optimization of aerodynamic design including structure analysis or 
heat transfer, complicated unsteady flow in jet engines, and so on. However, the 
increased complexity in computer programs due to the adaptation to complex 
configurations, the parallel computation, and the multi-physics couplings 
brought various problems. One of them is a difficulty in sharing analysis codes 
and know-how among CFD researchers, because they tend to make their own 
variations of the programs for their own purposes. The parallel programming is 
another problem that it made the program much more complicated and the 
portability was sometimes lost. 
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In order to overcome the difficulties in such complicated programming and to 
accelerate the code development, NAL has started a pilot project UPACS (Uni- 
fied Platform for Aerospace Computer Simulation) in 1998. UPACS is expected 
to free CFD researchers from the parallel programming, hut the biggest advan- 
tage would be that both CFD users and code developers can easily share various 
simulation codes through the UPACS common base. 



2 Concept of the UPACS Code 

The following design concepts have been determined. 

(1) Multi-block methods: Considering the adaptation to complex configurations, 
multi-block structured grid methods has been chosen as the first step. The multi- 
block approach is easily applied to parallel computing. The extension to unstruc- 
tured grid methods, and the overset grid method are also under consideration. 

(2) Separation of multi-block multi-processor procedures from solver routines: 
The parallel computation and multi-block data control processes are clearly 
separated from the CFD solver modules so that the solver can be modified 
without considering the parallel process and the multi-block data handling as if 
the solver module itself is only for single block problems. 

(3) Portability: The parallelization based on domain-decomposition using the 
message passing interface, MPI, is used to minimize the dependency on hardware 
architectures. 

(4) Structure and capsulation: The hierarchical structure of program and data 
are clearly defined and the modules are encapsulated so that the code can be 
shared and modified more easily among CFD researchers and developers. 

UPACS is written in Fortran 90 and runs on workstations and NWT. Since 
the original version of the program is not vectorized by the compiler of NWT, the 
NWT vector version of the code is provided. The difference of the vector version 
differs little from the original one. It is added the compiler directives and the 
inner DO loops are vectorized. 



3 Basic Structure of the Code 

3.1 Hierarchical Structure 

One of the key features in the UPACS code design is the hierarchical structure 
(Fig.l). This structure consists of three layers; 
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Fig.l. Hierarchical structure of UPACS 



(1) Bottom layer: Single-block level 

Subroutines in this layer calculate inside tbe block and its boundary condi- 
tions. Data of the neighboring blocks are already prepared as boundary values by 
the intermediate layer. Thus the routines in this layer can be easily programmed 
and replaced with other numerical solvers. 

(2) Intermediate layer: Multi-block level 

The data transfer between blocks and the distribution of blocks to processors 
are handled in this layer. These procedures are independent of calculations inside 
each 

block therefore the intermediate layer can be generalized and used like libraries. 
Users can freely modify solvers in the bottom layer without touching the parallel 
procedure which are controlled in the intermediate layer. 

(3) Top layer: Main loop level 

The top layer is the main routine which determines the framework of itera- 
tion algorithm. It is dependent on the solution method or numerical models and 
users would want to modify the main routine. Some of the calculation procedures 
related to parallel processing would also be defined in this layer. 

The flow of the program is described as follows: 

Initial process: reading grids and calculation of metrics are performed. A grid is 
extended to outside in two layers of cells. This extra cells are used by communica- 
tion of values with neighboring grids, or for wall or entrance/exit boundary. 
Iteration: mainly two subroutine, ‘updateGhostPhys’ and ‘timeintegration’ are 
called here. UpdateGhostPhys subroutine updates the extended cells of grids. At 
the wall boundary, values are set at the cells which is calculated for the boundary 
to satisfy the boundary condition. 
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3.2 Calculation 

UPACS is programmed to calculate based on three-dimensional Navier-Stokes 
equations. 

The equations are discretized by the finite volume method. A cell is a hexahe- 
dron which has a finite volume. The physical variable Q is defined at the center 
of a cell (cell-centered grid). 

The convective term is calculated with Roe^* scheme or AUSMDV^* scheme. 
The numerical flux on the cell boundary is evaluated with MUSCL scheme. 

The viscous term is calculated with second-order central difference discretiza- 
tion. 



3.3 Parallelization and Boundary Treatment 

The information of the connected grid and boundary conditions are unified by 
the concept of “window”. A window is a two-dimensional region which covers the 
surfaces of a grid. This information is given to the solver code by a text file which 
describes the geometry of windows on the grid and the type of windows 
( boundary or connected region to other grid). 

The values at the boundary connecting to a neighboring grid, are set by com- 
munication. The timeintegration subroutine calculates values of the next time 
step inside the grids. 

The basic communication strategy is described as follows: 

(1) A grid sends data to the connected grid asynchronously. 

(2) Receives data from the connected grid. Transforms and sets the received data 
to the extended area of the grid. The transformation of the data is the receiver’s 
task. It is necessary because a grid is not generally connected in same ij,k- 
direction with the neighbouring grids. 

(3) Waits for the completion of asynchronous send. This is necessary because the 
data to be sent asynchronously must he kept unchanged until finishing to send. 

Data in the face, the edge and the corner cells are transferred respec- 
tively(Fig.2). The edge cells and the corner cells are extrapolated with the values 
of the face cells and the edge cells before being set the received data. These are 
usually overwritten by the received data. 




186 Hiroyuki Yamazaki et al. 




3.4 Current Status of Development 

The UPACS code is currently under development through discussions on the 
detailed design and validation of the CFD solver and the multi-hlock multi- 
processor procedures. The figure 3 and 4 show the example of inviscid calcula- 
tions around the experimental SST(supersonic transport) with 105 blocks. 

Because of the intensive development of UPACS, the modification to provide 
vectorization version is currently intended to he little and the code is vectorized 
in one direction. 




Fig .3 Grids of the Experimental SST Model 
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4 Specification of NWT 

NWT (Numerical Wind Tunnel) is a parallel computer consisting of 166PEs 
(processor elements) connected with a cross-bar network. Each PE is a vector 
processor with pipelines of multiplicity 8. The add, multiply, load and store pipe- 
lines can be operated simultaneously. A scalar instruction is a Long Instruction 
Word(LIW) which issues 3 scalar instructions or 2 floating point instructions at 
once. Each PE has a main memory of 256MB exept for 4Pes with 1GB. The cross- 
bar network exclusively connects any pair of PEs without any influence by other 
PEs. Total peak performance is 280GFLOPS and the total capacity of the main 
memory is 44.5 GB. 
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5 Performance Results 



5.1 Test Cases 

The computational performance of the current version of UPACS is evaluated on 
NWT. The test cases are as follows: 

(1) single grid/single PE performance: A single PE calculation with one grid 
which has 64x64x64 cells. This is the largest size of the grid which can he calcu- 
lated by one 256Mbyte PE of NWT. 

(2) Parallel performance: parallel calculations with 2, 2x2, 2x4 grids connected. 
Each grid has 64x64x64 cells and connected one or two dimensionally. Each PE 
has one grid. 



5.2 Results 

(1) Single grid/single PE: 

The results of this case is shown in Table 1. In the current version the most inner 
DO loops are vectorized. This means the -direction of the loops are vectorized. 
If a grid is configured so that number of cells in -direction is long, the vector 
performance is improved (240x30x30 case). 



Table 1: Performance of single PE cases on NWT 



Grid size 


Equation 


Time[s] 

(one iteration) 


Performance 

[MFLOPS] 


64x64x64 


Euler 


1.452 


280.0 


64x64x64 


Navier-Stokes 


2.573 


317.6 


240x30x30 


Navier-Stokes 


1.595 


426.2 



(2) Parallel performance: 

Table 2 shows the parallel performance, based on Navier-Stokes equations. The 
size of one grid is constant (64x64x64) and this represents scalarbility perform- 
ance. The performance ratio is the parallel performance relative to the single PE. 

The calculation times of cell values are almost constant, and the difference of 
the iteration times between cases come from the diffence of the transfer times in 
the updateGhostPhys routine. The transfer time is affected by the number of the 
surfaces connected to other grids. In the 4x4 case, the inner four grids take more 
time than the outer grids. The inner grids have four surfaces to send or receive 
data, while the outer ones have two or three. In the 4x8 and 8x8 cases, grids have 
at most four surfaces to communicate, and the transfer time is almost same as 
4x4 cases. 
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Table 2 parallel performance on NWT 



Number of PEs 


Time[s] 

(one iteration) 


Time[s] 

UpdateGhostPhys 


Performance ratio 


1 


2.573 


0.063 


1 


2x1 


2.792 


0.288 


1.84 


2x2 


2.836 


0.325 


3.63 


2x4 


2.870 


0.359 


7.17 


4x4 


3.091 


0.581 


13.32 


4x8 


3.099 


0.589 


26.57 


8x8 


3.105 


0.595 


53.03 



Table 3 shows tbe parallel performance on Sun workstation Enterprise 4500. 
Tbe calculation is based on Navier-Stokes equations. The size of one grid is con- 
stant (64x64x64). The original code is tested, and the performance improvement 
effort for the workstation is not held. 



Table 3. Parallel performance on Enterprise 4500 



Number of PEs 


Time[s] 

(one iteration) 


Time[s] 

UpdateGhostPhys 


Performance ratio 


1 


157.36 


0.339 


1 


2x1 


159.14 


1.17 


1.98 


2x2 


159.28 


1.451 


3.95 


2x4 


159.39 


1.762 


7.90 



The difference of iteration times between cases comes from the UpdateG- 
hostPhys routine which sets values into the extended cells of grids due to the 
boundary conditions or data transfer (if parallel). 

The time charts about transferring data of 2x2 PE case are shown in Fig. 5 
created by VAMPIR analyzing tool. In the chart, the light colored bands show 
times spent in the user codes, the dark ones in MPI libraries. The oblique lines 
represent actual communication by MPI library. This figure shows most of time 
is spent in the transformation of received data and extrapolation processes. These 
processes are not currently modified to improve performance. 
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Fig. 5 Tranfer Times of 2x2 PE cases on NWT 



6 Conclusion 

A common CFD platform UPACS is introduced, which is under development by 
NAL. The concept of UPACS is described. It supports the multi-block grids, and 
uses the message passing library MPI for portability. The basic structure of the 
code is explained. The code has a structure of three layers to achieve the concept 
of multi-block, capsulation and so on. 

The parallel and vector performance of the code is evaluated on NWT, with 
simple cases. The results show good performance, though there are rooms for 
further performance improvements. 
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Abstract. Multiprocessor systems are increasingly being used to handle 
large-scale scientific applications that demand high-performance. How- 
ever, performance analysis is not as mature for multiprocessor systems as 
for uniprocessor systems, and improved ways of automatic performance 
analysis are needed to reduce the cost and complexity of developing dis- 
tributed/parallel applications. 

Performance analysis is commonly a cyclic process of measuring and 
analyzing performance data, identifying and possibly eliminating per- 
formance bottlenecks in slow progression. Currently this process is con- 
trolled manually by the programmer. We believe that the implicit knowl- 
edge applied in this cyclic process should be formalized in order to pro- 
vide automatic performance analysis for a wider class of programming 
paradigms and target architectures. This article describes the perfor- 
mance property specification language (ASL) developed in the APART 
Esprit IV working group which allows specifying performance-related 
data by an object-oriented model and performance properties by func- 
tions and constraints defined over performance-related data. Perfor- 
mance problems and bottlenecks can then be identified based on user- or 
tool-defined thresholds. In order to demonstrate the usefulness of ASL we 
apply it to HPF (High Performance Fortran) by successfully formalizing 
several HPF performance properties. 



1 Introduction 

Although rapid advances in processor design and communication infrastructure 
are bringing teraflops performance within grasp, the software infrastructure for 
multiprocessor systems simply has not kept pace. The lack of useful, accurate 

* The ESPRIT IV Working Group on Automatic Performance Analysis: Resources 
and Tools is funded under Contract No. 29488 
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performance analysis is particularly distressing, since performance is a key issue 
of multiprocessor systems. 

Despite the existence of a large number of tools assisting the programmer 
in performance experimentation, it is still the programmer’s responsibility to 
take most strategic decisions. Many performance tools are platform and lan- 
guage dependent, cannot correlate performance data gathered at a lower level 
with higher-level programming paradigms, focus only on specific program and 
machine behavior, and do not provide sufficient support to infer important per- 
formance properties. 

In this article we describe a novel approach to formalize performance bottle- 
necks and the data required in detecting those bottlenecks with the aim to sup- 
port automatic performance analysis for a wider class of programming paradigms 
and architectures. This research is done as part of APART Esprit IV Working 
Group on Automatic Performance Analysis: Resources and Tools [2]. In the re- 
mainder of this article we use the following terminology: 

Performance-related Data: Performance-related data defines information 
that can be used to describe performance properties of a program. There 
are two classes of performance related data. First, static data specifies in- 
formation that can be determined without executing a program on a target 
machine. Examples include program regions, control and data flow informa- 
tion, and predicted performance data. Second, dynamic performance-related 
data describes the dynamic behavior of a program during execution on a 
target machine. This includes timing events, performance summaries and 
metrics, etc. 

Performance Property: A performance property (e.g. load imbalance, com- 
munication, cache misses, etc.) characterizes a specific performance behavior 
of a program and can be checked by a set of conditions. Conditions are as- 
sociated with a confidence value (between 0 and 1) indicating the degree of 
confidence about the existence of a performance property. A confidence value 
0 means that the condition is most likely never true whereas a confidence 
value 1 implies that the condition presumably always holds. In addition, 
for every performance property a severity figure is provided that specifies 
the importance of the property. The higher this figure the more important 
or severe a performance property is. The severity can be used to concen- 
trate first on the most severe performance property during the performance 
tuning process. Performance properties, confidence, and severity are defined 
over performance-related data. 

Performance Problem: A performance property is a performance problem, 
iff its severity is greater than a user- or tool-defined threshold. 
Performance Bottleneck: A program can have one or several performance 
bottlenecks which are characterized by having the highest severity figure. If 
these bottlenecks are not a performance problem, then the program’s per- 
formance is acceptable and does not need any further tuning. 

For example, a code region may be examined for the existence of a com- 
munication performance property. The condition for this property holds, if any 



On Performance Modeling for HPF Applications with ASL 



193 



process executing this region invokes communication (communication time is 
larger than zero). The confidence value is 1 because measured communication 
time represents a proof for this property. In contrast, if prediction would be used 
to reflect communication, then the confidence value is likely to be less than 1. 
The severity is given by the percentage of communication time relative to the ex- 
ecution time of the entire program. If the severity is above a user or tool defined 
threshold, then the communication performance property defines a performance 
problem. If this performance problem is the most severe one, then it denotes the 
performance bottleneck of a program. If there are several performance properties 
with identical highest severity figure, then all of them are performance bottle- 
necks. Commonly, a programmer may try to eliminate or at least to alleviate 
the bottlenecks before examining any other performance problems. 

In this paper we introduce the APART Specification Language (ASL) which 
allows the description of performance-related data through the provision of an 
object-oriented specification model and which supports the definition of perfor- 
mance properties in a novel formal notation. Our object-oriented specification 
model is used to declare - without the need to compute - performance infor- 
mation. It is similar to Java, uses only single inheritance and does not require 
methods. A novel syntax has been introduced to specify performance properties. 
The objective of ASL is to support performance modeling for a variety of pro- 
gramming paradigms including message passing, shared memory, and distributed 
memory programming. In this paper we demonstrate the usefulness of ASL for 
HPF ([8] - High Performance Fortran) by successfully formalizing several HPF 
performance properties. ASL has also been applied to OpenMP and message 
passing programs details of which can be found in [4]. 

The organization of this article is as follows. Related work is presented in 
Section 2. ASL constructs for specifying performance-related data are presented 
in Section 3. This includes the base classes which are programming paradigm 
independent and classes to specify performance-related data that are specific 
to HPF programs. The syntax for the specification of performance properties 
is described in Section 4. Examples for HPF property specifications including 
HPF code excerpts are presented in Section 5. Conclusions and Future work are 
discussed in Section 6. 



2 Related Work 

The use of specification languages in the context of automatic performance anal- 
ysis tools is a new approach. Paradyn [9] performs an automatic online analysis 
and is based on dynamic monitoring. While the underlying metrics can be defined 
via the Metric Description Language (MDL) [12], the set of searched bottlenecks 
is fixed. It includes CPUbound, ExcessiveSyncWaitingTime, ExcessivelOBlock- 
ingTime, and TooManySmalllOOps. 

A rule-based specification of performance bottlenecks and of the analysis 
process was developed for the performance analysis tool OPAL [7] in the SVM- 
Fortran project. The rule base consists of a set of parameterized hypothesis with 
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Application 
name : String 

1 

versions \y I"* 
Version 

version_no : int 



tiles 1 ./ experiments , 0..* 



Source File 




Experiment 






name : String 
contents : String 




start time DateTime 


system 






nr_processors : int 




Machine 



1 nr_processors : int 



regions yO..* 

Region region 

start_pos : Position ^ 

■■ end_pos : Position 1 



profile 0 .* trace g..* 



. RegionSummary 




Event 




timestamp : float 
process : Process 







sub_regions 



Fig. 1. Base classes of performance-related data models 



proof rules and refinement rules. The proof rules determine whether a hypothesis 
is valid based on the measured performance data. The refinement rules specify 
which new hypotheses are generated from a proven hypothesis. 

Another approach is to define a performance bottleneck as an event pattern in 
program traces. EARL [15] describes event patterns in a more procedural fashion 
as scripts in a high-level event trace analysis language which is implemented as 
an extension of common scripting languages like Tcl, Perl or Python. 

The language presented here served as a starting point for the definition 
of the ASL but is too limited. MDL does not allow to access static program 
information and to integrate information for multiple performance tools. It is 
specially designed for Paradyn. The design of OPAL focuses more on the for- 
malization of the analysis process and EDL and EARL are limited to pattern 
matching in performance traces. 

Some other performance analysis and optimization tools apply automatic 
techniques without being based on a special bottleneck specification, such as 
KAPPA-PI [I], FINESSE [II], and the online tuning system Autopilot [13]. 

3 Performance-Related Data Specification 

In the following we describe a library of classes that represent static and dy- 
namic information for performance bottleneck analysis. We distinguish between 
two sets of classes. First, the set of base classes which is independent of any 
programming paradigm, and second, programming paradigm dependent classes 
which are shown for HPF. 
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3.1 Standard Class Library 

Note that we expect most data models described with this language will have a 
similar overall structure. This similarity is captured in the base classes. Future 
data models can build specialized classes in form of subclasses. 

Figure 1 shows the UML [14] representation of the base classes which are 
programming paradigm independent. The translation of the UML diagrams into 
the specified syntax is straightforward. Initially, there is an application for which 
performance analysis has to be done. Every application has a name and may 
possibly have a number of implementations, each with a unique version number. 
Versions may differ with respect to their source files and experiments. Every 
source file (the contents of which is stored in a generic string) has one or sev- 
eral static code regions each of which is uniquely specified by startjpos (position 
where region begins in the source file) and endjpos (position where region ends 
in the source file) . A position in a region is defined by a line and column num- 
ber with respect to the given source file. Experiments - denoting the second 
attribute of a version - are described by the time {start_time) when the experi- 
ment started and the number of processors (nr_processors) that were available to 
execute the version. Furthermore, an experiment is also associated with a static 
description of the machine (e.g. number of processors available) that is used for 
the experiment. Every experiment includes also dynamic data, i.e. a set of re- 
gion summaries (profile) and a set of events (trace). The class RegionSummary 
describes performance information across all processes employed for the exper- 
iment. Region summaries are associated with the appropriate region. The class 
Event represents information about individual events occurring at runtime, such 
as sending a message to another process. Each event has a timestamp attribute 
determining when the event occurred and a process attribute determining in 
which process the event occurred. 

3.2 HPF Class Hierarchy 

High Performance Fortran (HPF) [8] defines a set of language extensions to For- 
tran 90/95 to facilitate efficient parallel programming of scalable parallel archi- 
tectures. The main concept of HPF relies on data distribution. The programmer 
writes a sequential program and specifies how the data space of a program should 
be distributed by adding data distribution directives to the declarations of ar- 
rays. A compiler then translates a program into an efficient parallel SPMD target 
program using explicit message passing on distributed memory machines. More 
details about HPF can be found in [8]. Figure 2 shows the classes that model 
static information for HPF programs [8]. Class HPFRegion is a subclass of Re- 
gion (see Figure 1) and contains the following attributes (Figure 3) representing 
static performance-related information: 

— dirs describes HPF directives such as PROCESSORS, DISTRIBUTED, etc. 

— deps specifies data dependences implied by code regions. A data dependence 
is described by the source, destination, type (e.g. true-, anti- and output- 
dependence), direction, distance, and level (loop-carried or -independent). 



196 Thomas Fahringer et al. 



Region 

start_pos : Position 
end_pos : Position 



HPFRegion 

dirs : setof hpf_directrve 



Dependence 




HPFDataDeclaration 


src : Region 




name : String 


dst : Region 




data type : String 


tvoe : deo tvoe 






direction : dep_dir 




tvoe ; hot var arr 
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HPFArrayDimension 


deci : 


HPFDataDeciaration 


size : 
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type 


hot distr tvoe 


block 


size : int 


align 


HPFArrayDimension 



Fig. 2. Static performance-related information for HPFRegions 



— deck specifies HPF data declarations for scalars and arrays. Data declara- 
tions are described by attributes name (name of a data object), data_type 
(e.g. integer), rank (dimensionality of data), type (data is an array or a 
scalar), alloc (data has been declared DYNAMIC or STATIC), and format 
(distributed, replicated, prescriptive, etc.). In case of an array, additional 
information about every dimension is provided which includes the size of the 
dimension, distribution type (e.g. BLOCK or CYCLIC), block size in case 
of BLOCK distribution, and alignment information. 

Figure 3 displays several subclasses which extend HPFRegion by following code 
regions: HPFProcedure, HPFLoop, HPFIf Block, HPFBasicBlock, HPFProcedure- 
Call, and HPFArrayAssignment. Class HPFLoop can be further specified by 
attribute Itype which indicates the type of loop, for instance, DO loop, DO IN- 
DEPENDENT loop, etc. A DO INDEPENDENT loop relates to HPF’s parallel 
loop which asserts that the loop iterations can be executed in any order. HPF 
array assignment cover both Fortran?? and FortranQO array operations. 

Figure 4 shows the HPF class library for dynamic information. Class HPFRe- 
gionSummary extends class RegionSummary (see Figure 1) and comprises three 
attributes: processes specifies the set of processes executing a region, sums reflects 
performance summary information across all processes executing the region, and 
proc-sums indicates performance summary information for a region with respect 
to individual processes. 

Class HPFSummary contains several performance attributes which are aver- 
age values across all processes with respect to a specific region: 

— nr_executions: number of times the region has been executed 

— duration: time spent in executing the region 
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Fig. 3. Subclasses of HPFRegion 



— comm_time: communication time 

~ dep-comm^time'. communication time caused by data dependences 

— align_commJ,ime: communication time caused by data alignment 

— sync-time: synchronization time 

— idle-time: idle time 

— io-time: input/output time 

— compiler -ovh-time: compiler overhead time 

— inspector-time: time spent in inspector /executor phase (compiler inserted 
code to handle irregular problems) 

— expLredistr-time: time spent in explicit redistribution of data caused by HPF 
REDISTRIBUTE directive. 

— impLredistr-time: time spent in implicit redistribution of data at procedure 
boundaries. 

— nr-cachc-misses: number of cache misses. 

Compiler overhead time amounts for extra costs implied by compiler-inserted 
statements in a parallel program. For instance, in order to enforce the owner- 
computes paradigm (write operations can only be executed by the owning pro- 
cess) additional IF-conditions may be inserted in a parallel program. Executing 
these IF-conditions is considered to be compiler overhead. 

Class HPFProcessSummary contains all attributes of class HPFSummary 
with respect to a specific process (identified by attribute process). Note that 
dynamic performance-related information can both be measured [9,10,3] or pre- 
dicted [6] . Some of this information however require a close interaction between 
performance measurement /prediction tools and compilers [6]. 
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dep_comm_time : float 
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processjd : ini 



nr_executions : int 
duration : float 
comm_time : float 
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io_time : float 
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nr_cache_misses : int 



Fig. 4. Dynamic performance-related information (summaries) in the HPF data 
model 

4 Performance Property Specification 

A performance property (e.g. load imbalance, synchronization, cache misses, re- 
dundant computations, etc.) characterizes a specific performance behavior of a 
program. The ASL property specification (Figure 5) syntax defines the name 
of the property, its context via a list of parameters, and the condition, confi- 
dence, and severity expressions. The property specification is based on a set of 
parameters. These parameters specify the property’s context and parameterize 
the expressions. The context specifies the environment in which the property is 
evaluated, e.g. the program region and the test run. The condition specification 
consists of a list of conditions. A condition is a predicate that can be prefixed by 
a condition identifier. The identifiers have to be unique with respect to the prop- 
erty since the confidence and severity specifications may refer to the conditions 
by using the condition identifiers. The confidence specification is an expression 
that computes the maximum of a list of confidence values. Each confidence value 
is computed as an arithmetic expression in an interval between zero and one. The 
expression may be guarded by a condition identifier introduced in the condition 
specification. The condition identifier represents the value of the condition. The 
severity specification has the same structure as the confidence specification. It 
computes the maximum of the individual severity expressions of the conditions. 
The severity specification will typically be based on a parameter specifying the 
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rank basis. If, for example, a representative test run of the application has been 
monitored, the time spent in remote accesses may be compared to the total ex- 
ecution time. If, instead, a short test run is the basis for performance evaluation 
since the application has a cyclic behavior, the remote access overhead may be 
compared to the execution time of the shortened loop. 



5 HPF Performance Properties 

This section demonstrates the ASL constructs for specifying performance prop- 
erties in the context of HPF. Some global definitions are presented first which 
are then used in the definitions of a number of HPF properties. 



property 



arg 

pp-condition 

conditions 

condition 
pp- confidence 

confidence 

pp-severity 

severity 



is PROPERTY pp-name arq-list ’)’ 

[LET def ^ IN] 

pp-condition 
pp- confidence 
pp-severity 

is type ident 

is CONDITION conditions 
is condition 

or condition OR conditions 

is [’(’ cond-id ’)’ ]bool-expr 

is CONFIDENCE MAX ’(’ confidence-list ’)’ 

or CONFIDENCE confidence 

is [’(’ cond-id ’)’ ] arith-expr 

is SEVERITY MAX ’(’ severity-list ’)’ 

or SEVERITY severity 

is [’(’ cond-id ’)’ ] arith-expr 



Fig. 5. ASL property specification syntax 



5.1 Global Definitions 

In most property specifications it is necessary to access the summary data of a 
given region for a given experiment. Therefore, we define the summary function 
that returns the appropriate HPFRegionSummary object. It is based on the set 
operation UNIQUE that arbitrarily selects one element from the set argument 
which has cardinality one due to the design of the data model. 



HPFRegionSummary summary(Region r, Experiment e)= 

UNIQUE({s IN e. profile WITH s . region==r}) ; 



The duration function returns the execution time of a region which is given 
by the arithmetic mean across all processes that execute the region. 

float duration (Region r, Experiment e) =summary(r , e) . sums . duration; 



For all HPF performance properties the severity is computed by relating 
some aspect of the execution time to the duration of a given rankJ>asis region 
(for instance, execution time of the entire program) in the experiment. 
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5.2 Property Specifications 

The paralleLcosts property determines whether the total parallel cost in the 
execution of a parallel program is non-zero. The parallel costs of a region can 
be subdivided into four categories: communication, synchronization, compiler 
overhead, and input/output time. 

A region accounts for this property if cost_sum is greater than 0. If parallel 
costs can be measured then the confidence for this condition is one. In case of 
predicting parallel costs, the confidence may be less than one. The severity of 
this property is the fraction of the time spent for parallel costs compared to the 
duration of rank basis, typically the duration of the main program. Note, that 
comm_time, sync-time, compiler _ovh_time, io-time, and duration are summary 
figures across all processes executing the region. 

Property parallel_costs (HPFRegion r, Experiment e, Region rcink_basis) { 

LET 

float cost_smn - summary(r ,e) . sums . comm_time + 
summary (r ,e) . sums . sync_time + 
summary (r ,e) . sums . compiler_ovh_time + 
summary (r ,e) . sums . io_time ; 

IN 

CONDITION: cost_sum>0; 

CONFIDENCE; 1; 

SEVERITY : cost_sum/duration(rank_basis , e) ; 

} 



The severity of this property is larger than the severity of the individual prop- 
erties for each of the categories. This may lead to the selection of the parallel cost 
property as a performance problem according to the predefined severity threshold 
while the individual properties, i.e. comm_time, sync-time, compiler _ovh_time, 
and io-time, may not be referred to as performance problems. This property 
determines whether a region implies communication. 



Property communication_costs 
LET 



(HPFRegion r, Experiment e, 
Region rank_basis) {. 



float comm_time = summary (r, e) . sums . comm_time ; 
IN 



CONDITION: comm_time > 0; 

CONFIDENCE: 1; 

SEVERITY: comm_time / duration(rank_basis , e) ; 



The condition and severity of property communicatiomcosts is based on 
comm_time which is the arithmetic mean across all processes executing a re- 
gion. The severity is the communication time divided by the execution time 
of the rank basis. A performance property for input/output overhead can be 
described very similar to the communication cost property. 
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Property redist_costs (HPFRegion r, Experiment e, Region rank_basis) { 
LET 

redistl = summary (r , e) . sums . impl_redistr_time ; 
redist2 = summary(r ,e) . sums . expl_redistr_time ; 

IN 

CONDITION: 

CCondl) 

( (typeof (summary (r, e) . region) ==HPFProcedure) 

AND 

(EXISTS dec IN summary(r,e) .region. decls 
SUCH THAT 



dec.format==PRESCRIPTIVE DR dec . f ormat==TRANSCRIPTIVE) 
AND 

(redistl >0)) 

OR 

(Cond2) 

( (typeof (summary (r ,e) . region) ==HPFRedistribute) 

AND 

(EXISTS dec IN summary(r,e) .region. decls 
SUCH THAT dec. alloc == DYNAMIC) 



AND 

(redist2 >0)) 

CONFIDENCE : MAX ( (Condi) ->0 . 8 , (Cond2) ->1 . 0) ; 

SEVERITY: MAX( (Condi) ->redistl / duration(rank_basis , e) , 

(Cond2)->redist2 / duration(rank_basis , e) ) ; 



} 



Property redist_costs specifies the time spent in redistributing data inside 
(redistribution of dummy arrays and dynamic arrays) or outside of procedures 
(redistribution of dynamic arrays). We include two different conditions for this 
property to hold each of which is associated with its own confidence value. The 
first condition ensures that a region is a procedure. If the procedure has prescrip- 
tive or transcriptive mapping and impLredistr_time is non-zero then redistribu- 
tion may occur at the procedure boundary. The second condition covers the case 
where DYNAMIC arrays are redistributed based on the HPF REDISTRIBUTE di- 
rective which causes expLredistr_time to be non-zero. A close integration between 
performance measurement tool and HPF compiler is needed in order to associate 
measured performance data with the location in the HPF program that causes 
redistribution. Commonly, the more conditions are included for a property the 
more options are available to determine whether a property holds and the more 
refined information about the cause of the property is supplied. For instance, if 
prediction fails, then monitoring could be employed to prove a certain perfor- 
mance property. For property redist-costs, we include two conditions with more 
detailed information than if only a single value for all redistribution costs would 
be provided. 

The confidence of property redist-costs is the maximum confidence value 
across both conditions as specified above. Note that for this property we as- 
sume that measuring implict redistribution (condition 1) may be less precise 
than timings of explicit redistribution (condition 2) due to a lack of compiler 
information. The severity is the maximum across the time spent in, respectively, 
implicit and explicit redistribution of data relative to the execution time of the 
region selected as the rank basis. 

Property unevenjwork_distribution specifies how even the computations of a 
parallel program have been distributed across all processes executing a region. 
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Property uneven_work_distribution (HPFRegion r, Experiment seq, 

Experiment par, Region rank_basis) { 

LET 

int nr_processes = COUNKprocs WHERE procs IN summary (r, par) .processes) ; 

float opt_duration = duration(r,seq)/nr_processes; 

float deviation = SQRT(SUM (EXP(proc_sums . duration - proc_sums . comm_time 
- proc_sums . sync_time - proc_sums . idle_time - 

proc_sums . compiler_ovh_time - proc_sums . io_time - opt_duration, 2) 

WHERE proc_sums IN summary (r, par) .proc_sums )) 

IN 

CONDITION: (deviation / opt_duration) > uneven_threshold; 

CONFIDENCE: 1; 

SEVERITY: summary (r, par) . sums .duration / duration (rcink_basis , par) ; 

} 

The standard deviation of the computational costs (excluding communica- 
tion, synchronization, idle time, compiler overhead, and input/output time) of 
every process with respect to the optimal duration (sequential execution time 
divided by number of processes) is computed. The condition is then given as 
the variation coefficient normalized with the optimal duration compared against 
a threshold. The severity is defined as the execution time divided by the rank 
basis. 

The data distribution directives of an HPF program have a decisive impact 
on the resulting quality of the work distribution. For instance, the nested DO 
loop in the following code excerpt implies a triangular loop iteration space. 

!HPF$ DISTRIBUTE A(BLOCK,*) 

DO i=l,n 
DO j— i,n 

A(i,j) = A(i,j)-A(i.j-l)*A(i.i) 

END DO 
END DO 

Distributing array A column-wise causes a very uneven distribution of the 
loop iterations across a set of processes executing this loop. If array A would 
be distributed CYCLIC (for instance, CYCLIC in the first and replicated in 
the second dimension), then the corresponding work distribution can be nearly 
optimal depending on how many processes are involved in executing this loop. 
A more detailed technical report with many more examples about performance 
modeling for HPF applications based on ASL can be found in [.5] . 

6 Conclusions and Future Work 

In this article we describe a novel approach to the formalization of performance 
problems and the data required to detect them with the future aim of support- 
ing automatic performance analysis for a variety of programming paradigms 
and architectures. We present the APART Specification Language (ASL) devel- 
oped as part of the APART Esprit IV Working Group on Automatic Perfor- 
mance Analysis: Resources and Tools. This language allows the description of 
performance-related data through the provision of an object-oriented specifica- 
tion model and supports definition of performance properties in a novel formal 
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notation. Performance-related data can either be static (gathered at compile- 
time, e.g. code regions, control and data flow information, predicted performance 
data, etc.) or dynamic (gathered at run-time, e.g. timing events, performance 
summaries, etc.) and is used as a basis for describing performance properties. A 
performance property (e.g. uneven work distribution, communication, redistribu- 
tion costs, etc.) characterizes a specific type of performance behavior which may 
be present in a program. A set of conditions defined over performance-related 
data verify the existence of properties during (the execution of) a program. We 
applied ASL to HPF by successfully formalizing several HPF performance prop- 
erties. ASL has also been used to formalize a large variety of MPI and OpenMP 
performance properties which is described in [4]. 

Two extensions to the current language design will be investigated in the 
future: First, the language will be be extended by templates which facilitate 
specification of similar performance properties both within and across program 
paradigms. Second, meta-properties may be useful as well. For example, com- 
munication can be proven based on summary information, i.e. communication 
exists if the sum of the communication time in a region over all processes is 
greater than zero. Meta-properties can be useful to evaluate other properties 
based on region instances instead of region summaries. 

ASL should be the basis of a common interface for a variety of performance 
tools that provide performance-related data. Based on this interface we plan to 
develop a system that provides automatic performance analysis for a variety of 
programming paradigms and target architectures. 
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Abstract. This paper presents a new processor allocation approach called “a 
generalized k-Tree-hased model” to perform dynamic suh-system alloca- 
tion/deallocation decision for partitionahle multi-dimensional mesh-connected 
architectures. Time complexity of our generalized k-tree-hased suh-system al- 
location algorithm is 0(k“2'‘(N^-l-Np)-l-k^2“) for the partitionahle k-D meshes and 
O(N^-l-Np) for the partitionahle 2-D meshes, where is the maximum number 
of allocated tasks, Nj, is the corresponding number of free sub-meshes, N is the 
system size, and N^-i-Nj, < N. Most existing processor allocation strategies have 
been proposed for the partitionable 2-D meshes with various degrees of time 
complexity and system performance. In order to evaluate the system perform- 
ance, the generalized k-Tree-based model was developed and by simulation 
studies the results of applying our k-Tree-based approach for the partitionable 
2-D meshes were presented and compared to existing 2-D mesh-based strate- 
gies. Our results showed that the k-Tree-based approach (when it was applied 
for the partitionable 2-D meshes) yielded the comparable system performance 
to those recently 2-D mesh-based strategies. 



1 Introduction 

A multi-dimensional (k-D) mesh-connected parallel architecture is a useful network type 
of parallel systems for high performance parallel applications that may require different 
degrees of processor interconnection (such as 1-D text processing, 2-D image processing, 
3-D graphic processing, etc.). The partitionable k-D mesh system is provided (at run time) 
for executing various independent applications (or tasks) in parallel. Examples of proto- 
types and commercial parallel systems (which support a multi-user environment) include 
the Intel Touchstone system [5], the Intel Paragon XP/S [6], the Intel/Sandia ASCI System 
[10], etc. In the partitionable parallel systems, a number of independent smaller tasks 
(from the same or different applications) come in, each requiring at run time a separate 
sub-system (or partition) to execute. In order to provide appropriate free sub-systems for 
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new tasks, a special designed operating system (known as the processor allocator) has to 
dynamically partition the computer system to allocate a sub-system for each incoming 
task, as well as to deallocate a sub-system and recombine partitions as soon as they be- 
come available when a task completes. For the partitionable 2-D mesh-connected sys- 
tems, a number of processor allocation/deallocation (decision) techniques have been pro- 
posed in the past such as bit- map approaches [11], [16] and non-bit-map approaches [1], 
[2], [3], [4], [7], [8], [9], [12], [13], [14], [15]. Among those existing 2-D mesh-based 
strategies, time complexities of recently sub-system allocation methods (that yield the 
comparable system performance) are O(Na^) of the Busy List (BL) [3], ©(N^Vn) of the 
Quick Allocation (QA) [15], and 0(N/) for the 2-D meshes and 0(Nf*^) for the k-D meshes 
of the Free Sub-List (FSL) [7], where N is the system size, Na is the number of allocated 
tasks (Na < N), and Nf is the number of free sub-systems (Nf < N). 

In this paper, we propose the designing of “a generalized k-tree-based model” in order 
to perform d 3 Ttamic sub-system allocation/deallocation decision for partitionable multi- 
dimensional mesh-connected architectures. Our generalized model includes a k-Tree 
system state representation and a number of generalized algorithms (such as the network 
partitioning algorithm, the sub-system combining algorithm, the best-fit heuristic for allo- 
cation decision, the sub-system allocation/deallocation decision algorithm) Time com- 
plexity of the generalized k-Tree-based approach for the partitionable k-D meshes is 
0(k"'2\NA+Np)-i-k^2^^), where N^ is the maximum number of allocated tasks, Np is the 
corresponding number of free sub-meshes, N is the system size, and Na+Np < N. There- 
fore, time complexity of our approach for the partitionable 2-D meshes is only O(Na-i-Np). 
For the performance evaluation (by simulation study), the generalized k-Tree-based model 
was developed and the system performance of our application for the partitionable 2-D 
meshes was presented and compared to the recently 2-D mesh-based strategies such as the 
Busy List (BL) [3], the Free-Sub List (FSL) [7], and the Quick Allocation (QA) [15]. Our 
simulation results showed that the k-Tree-based approach yielded the comparable system 
performance to those recently 2-D mesh-based strategies. 

In the next section, we present the generalized k-Tree-based model to perform proces- 
sor allocation/deallocation decision for the partitionable k-D mesh parallel machines as 
well as the corresponding time complexity. Section 3 presents the evaluated system per- 
formance of the generalized k-Tree-based model. Then, the performance results (by 
simulation study) for the 2-D mesh networks are presented and compared to existing 2-D 
mesh-based strategies. Finally, conclusions are discussed in Section 4. 



2 k-Tree-Based Model to Sub-System Allocation for k-D Meshes 

In this paper, “the generalized k-Tree-based model” is proposed to perform sub-system 
allocation/deallocation decision for the partitionable multi-dimensional (k-D) mesh- 
connected architectures. In the generalized k-Tree-based model, we use a data structure, 
called a “k-Tree” to represent the system states of the partitionable k-D mesh-connected 
systems. Our generalized k-Tree-based model includes a k-Tree system states represen- 
tation (Section 2.1) and a number of generalized algorithms for each sub-system alloca- 
tion/deallocation decision such as the network partitioning algorithm (Section 2.2), the 
sub-system combining algorithm (Section 2.3), the best-fit heuristic for allocation decision 
(Section 2.4), and the sub-system allocation/deallocation decision algorithm (Section 2.5). 




A “Generalized k-Tree-Based Model to Sub-system Allocation” 



207 



2.1 k-Dimensional Mesh Systems and k-Tree System Stage Representation 



DEFINITION 1 : A k-Dimensional (k-D) Mesh Network (of size N = n^ x x... x is 
defined as a network graph G(V, E) = Gj x Gj x... x G^ [or a product of k linear arrays 
Gj(V|, R) of iij nodes; Vj = { 1, 2,..., nj and R = {<j, j+l> | j = 1, 2,..., n-1 }], where V = 
{a = (a,, a^, aj be an address of any PE (processing element) in G | Uj e Vj, a^ e Vj, 
. . a^ e V|^} and E = {<a, P> be a link between any two PEs in G, a = (a,, a^, . . a^) and 
P = (bj, b^, b^) I there exists an i such that <a,, bj> e Ej, where i = 1, 2, k.}. 



1 



n 




ID Linear 
Array 
N = n 



( 1 , 1 ) 




2D Mesh 
N = n,xn2 



(n,,n2) 



(1,1,1) 




3D Mesh 
N=nixn2XP3 

(ni,ri2,n3) 



The “k-Tree” data structure is used to represent system states (or store allocation informa- 
tion) of the partitionable k-D mesh-connected parallel system, where k is a number of 
dimensions of the multi-dimensional (k-D) mesh-connected system. This is a special k- 
Tree since sub-systems’ sizes after partitioning are not necessary equal, unlike a balance 
k-Tree of the same sub-system sizes after partitioning. In this paper, a “system” refers to 
a given partitionable k-D system, represented as a k-product network Gj x G^ x ... x G^ of 
size N = Uj X n^ X . . . X n,^ and a “sub-system” refers to a smaller k-D system of size N’ = 
Uj’x Uj’x ...X n^’, where a’ < u and i = 1, 2, ..., k, which is provided for incoming task(s). 




In our k-Tree-based approach, the number of nodes in the k-Tree is dynamic, correspond- 
ing to the number of tasks allocated. At the start (see Fig. 1.), the k-Tree consists of only 
one node (called the root), used to store the system information (i.e., a size, a base- 
address, a status, etc.) of the initial system (which is the k-D network Gj x G^ x ... x 0,^ (or 
n'‘G,) of size N = Uj X ... X n^ (or nV) with a base address a = (1, 1, ..., 1), where i = 1, 2, 
..., k. During execution (run) time when many jobs (or tasks) are allocated, each leaf 
node (representing a sub-system) may be available or free (status = 0) or unavailable or 
busy (status = 1) and each internal node is partially available (status = x). In order to 
allocate an incoming task, each larger free node in the k-Tree can be dynamically created 
and partitioned into a number of children/node, called buddies. Assume that the incom- 
ing tasks always request the same value “k” as provided by the system. Therefore, the 
number of buddies = 7^ (described later in Section 2.2). 




n k = 1 n, X ri 2 k = 2 n,xn 2 xn 3 k = 3 
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2.2 Network Partitioning Algorithm 

The network partitioning is a partitioning process that partitions all k dimensions (or k 
linear array networks) of the k-D mesh system (of size N = nj x n^ x... x n^) into smaller 
sub-systems and allocates one for the request (k-D mesh of size p, x p^ x . . . x p,^, where p^ 
< Uj, i = 1,2, . . ., k). Under this network partitioning, the relationship of the corresponding 
buddy’s ID, buddy’s base-address, and buddy’s size are defined as follows: 

Let a requested network is defined as a network graph G’ = G/ x G^’x . . . x G|^’(of size 
Pj X Pj X ... X p^, where Pj < n, i = 1, 2,...,k). Therefore, the number of buddies/node in 
each partitioning is equal to 2*' (with the assigned IDs = I, 2, 3, ..., 2'") since all k dimen- 
sions are partitioned into two sizes, which are not necessary equal. Next, we introduce the 
corresponding buddy’s base-address and buddy’s size which can be defined in the 
“Buddy-ID-Address-Size conversion” algorithm (see Fig. 2.). This conversion process is 
introduced to provide the small number of steps for the network partitioning algorithm 
(described next) and the sub-system combining algorithm (described later in Section 2.3). 
This process (i.e., identifying #buddies = 2*^, their base-addresses and sizes) is computed 
in k2'^ steps and hence the total time complexity is 0(k2’‘). 



(ai, 32) (ai, 32+P2) 



Let R be a considering root node (of size N = n, x n^x ... x nj; 

J be a requested task or job (of size = p, x p^, x . . .x pj; 
a = (a,,a 2 ,..., aj be R’s base address and aiso R's Buddy#1; 

Size of R’s Buddy#1 = p,xp^x...x p^ (or a requested task’s size); 

Conversion Method is 

1 . Convert a set of ali buddy-iDs fo fheir corresponding k-bit-iDs 
[ computed in steps ]; 

2. Compute a buddy’s base-address = (a), a,,’ a^’) and 

its size = (n,’x n^’x . . .x n)) [ computed in k2‘ steps ]; 
for] = 1, 2 k, 

a’j=a; n’|= p^; if the j* bit (or dimension) b,,,= 0, 
a’j= a.+Pp n’|= n-Pp otherwise (bp,= 1 ) since that dimension 

is partitioned, corresponding to the size of the Buddy#1 . 
i.e., for k = 2, we have 4 buddies (i, 2, 3, 4) whose base addresses and sizes are shown above: 



it (1 

PlXp2 


— 

P1X(P2-P2) 


'r' 

\ Pi) 
\XP2 


\ (ni-Pi)x 
\ (H2-P2) 



(3l+Pl, 32) (ai+Pi, 32+P2) 



Fig. 2. The “Buddy-ID-Address-Size conversion” algorithm for the network partitioning. 



For example (see Fig. 3.), consider a 2-D mesh system (of size N = n, x n^ = 64 x 64) 
which is stored in the k-Tree’s root with a base address a = (a^, a^) = (1, 1), where k = 2. 
Suppose the first incoming task requests a sub-system of size 20x20. For this task 
(20x20), the root (2-D mesh system of size 64x64), is partitioned into 2^^ = 4 buddies. 
Let’s assume that the request will be allocated to the sub-Buddy#l (at level 2). 



( 1 , 1 ) 




(64,64) 



By appiying “Buddy-iD-Address-Size conversion” 
aigorithm (in Fig. 2.), the sub-Buddy#1’s base- 
address is (a,, a 2 ) = (1, 1) and its size is (p, x p 2 ) = 
(20x20). First the set of bit-address of each buddy 
i, {i / i =1,2, 3, 4) is {(b, bo) I bi = 0 or 1} = {00, 01, 
10, 11), and hence its corresponding base-address 
is {(a,’, a 2 ’) I 3j’ = 3j -r (Pj * bj.,)} = {(1,1), (21,1), 
(1,21), (21,21)} and its corresponding size is {(n,’, 
n 2 ’)l = {(20x20), (44x20), (20x44), (44x44)}. 



Fig. 3. An example of a 2-D mesh of size N = 64x64 with an allocated task of size 20x20. 
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2.3 Sub-System Combining Algorithm 

The sub-system combining is applied during processor allocation or deallocation. First, 
the “Combinations of2‘ Adjacent Buddies” algorithm is introduced (see Fig. 4.) in order to 
combine buddies (where j = 1, 2, ..., k-1) into the larger free sub-systems. This algo- 
rithm is computed in 0(k2'‘'^) time for each j and hence 0(k2'‘) time for all values of the 
variable j, where j = 1, 2, 3, ..., k-1 (since k2’ + k2^ + k2^ + ... + k2’‘'’ = 2k[2’‘'-l]). 

Let R be a root node (at level L-1) that has all buddies = 2“ at level L. 

k(2‘4 Combinations are described (to combine 2 free buddies or the f dimension, j=1 , 2 k-1 ) as follows: 

For each j, 

1 . Create a set (2‘‘ elements ) of (k-j)-blt binary strings [ computed in (k-j)2"‘ steps ]; 

2. Append these 2‘ ' elements with j *’s to create a set of Initial (adjacent) k-bit ternary strings [ computed in 

k2!"' steps ]; 

3. Shift left (k-1) times for each ternary string (T = G ... t,y of all 2‘’ strings, where t e {0,1,*} and then 

k(2‘‘) combinations are obtained [ computed in k2"‘ steps ]. Note: Each ternary string T compounds of 
(k-j) O’s or 1’s and j *’s. 



Fig. 4. The “Combinations of 2' adjacent buddies” algorithm for the sub-system combining. 

For example, consider a (64x64)-mesh system with 4-buddy partitioning, where k = 2 and 2^ = 4 (see Fig. 3.). In 
order to combine one of k dimensions or 2' Buddies (such as j = 1 ), there are k2^' (= 2x2^’' = 4) possible combi- 
nations, recognized as follows: First, a 4-element set of “a compound of 2' adjacent buddies” is computed as a 
set of binary strings (0, 1). Next, each of these two elements is appended with j (or 1) * to create the corre- 
sponding set of ternary string {0*, 1*}. Then, each ternary string is shifted left (k-1) times to create a 4-element 
set of fernary string (0*, *0, 1*, *1} which represents a set of combinable strings (i.e., a ternary string 1* is inter- 
preted as a set of 2-adjacent Buddies (10, 11} (or {Buddy#3, Buddy#4}) for a combined system’s size = 64x44.) 



Next, we classify the k-Tree-based sub-system combining algorithms into three groups, 
which will be applied later in the allocation and deallocation procedures (Section 2.5): 



ALGORITHM C.l “Combine All Buddies”'. This combining procedure is used to combine 
all free 2'‘ buddies (at level L) into a larger free k-Tree’s node at level L-1. This combin- 
ing process is computed in 0(2^) time and it is applied after finishing the deallocation of 
any finished task in order to maintain the minimum number of nodes and the maximum 
free nodes’ sizes in the k-Tree as much as possible. 



ALGORITHM C.2 “Combine Some Buddies” (or called “Buddy- 
Buddy combining”): This combining procedure is used to combine a 
number (i.e., 2, 4, ..., or 2’‘'or j = 1, 2, ..., k-1) of adjacent free bud- 
dies (of the same root sub-tree) at level L into a larger free sub-system 
in order to allocate for an incoming task (whose requested a size 
which is larger than that of each of 2^ buddies but is less than or equal 
to that of the combined sub-system.) After applying the algorithm (in 
Fig. 4.: see also the following figures for the combining of 2-D 
meshes), for a ternary string T, the combined sub-system’s size and its 
base-address are computed by using the “2‘ Combinable Buddy-ID- 
Address-Size conversion” algorithm (see Fig. 5.) which can be com- 
puted in (3k2^ + k) steps. Therefore, time complexity of this combin- 
ing for each j in order to combine all possible ' k2'‘'^ combined 
sub-systems (from 2‘ adjacent free buddies) is 0(k^2’‘) and hence 
0(k^2'‘) for all value of j, where j = 1,2,..., k-1. 



'O 

o 




2 

01 







3 
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4 

11 



•0 or {1 , 3} 



*1 or (2, 4} 
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Let R be a considering root node (of size N = n^x ngX.-.x n^ & base address a = (a^ .ag,..., aj); 

T be a ternary string (t„.i . . . t^ tj, from Algo, (in Fig. 4.) i.e., T = 1 * or {Buddy#3, Buddy#4} of the 2D meshes; 
Buddv-ID-Address-Size Conversion Method is 

1. Convert the ternary string T=(t„.i...ti t^) to a set {2‘ elements) combinable binary-IDs (b,^.i...bi bj,) [ i.e., T = 1* ^ 
{10, 11} = (Buddy#3, Buddy#4}] [ computed in (k^ + k) steps ] by 

1.1 create 2' binaries (rj.,...r^ rj and replace them in the location(s) of j’*s in T; 

1.2 convert binary-IDs to a set of integer-IDs (1 < ID < 2\ where 1 = 0...00, 2 = 0...01, 3 = 0...10, ... , 2 " = 
1 ... 11 ); 

2. Compute a base-address & a size of each combined sub-system of 2^ sub-systems [computed in 2k^ steps] by 

2.1 a base address = the base address of the minimum buddy-ID; 

2.2 a combined size = (n/x ... x n^’), for Vi, i = 1, 2, ..., k; 

n^ = (of the min buddy-ID) -i- n| (of the max buddy-ID) if (b^.^ (of the min buddy-ID) 

bi .1 (of the max buddy-ID)); 

n/ = (of the min buddy-ID) otherwise. 



Fig. 5. The “2^ Combinable Buddy-ID-Address-Size conversion” algorithm (Algorithm C.2). 



ALGORITHM C.3 ''Combine a Buddy and Corresponding Sub-Buddies'' (or called 
‘‘Buddy-SubBuddy combining”) and "Combine Some SubBuddies" (or called “SubBuddy- 
SubBuddy combining”): These combining procedures are used to combine some free 
buddies (at level L) and their corresponding sub-buddy nodes (at Level L+1) to yield more 
sub-system recognition (from any partitioning size) than those obtained from the combin- 
ing in Algorithm C.2. Then, we introduce the “Buddy-SubBuddy combining” algorithm 
(see Fig. 6.) for recognizing k2'" combined sub-systems of a free buddy at level L and its 
adjacent free nodes (or sub-buddies) at level L+1 and also the “SubBuddy-SubBuddy 
combining” algorithm (see Fig. 7.) for recognizing k2’"'^ combined sub-systems of some 
adjacent free nodes at level L+1 [see also following figures for all possible combining for 
the 2-D meshes]. Each of these conversion processes can be computed in 0(k^2'") time for 
each combined node (or string) and 0(k^2^’") time for all possible 2*' strings. 




Combine the T* dimension for each of 4 buddies (1 ,2,3,4} 





Combine the 2"'^ dimension for each of 4 buddies (1 ,2,3,4} 



+ Combine sub-buddies (or middle of 2 adjacent buddies 
{(1,3), (2,4), (1,2), (3,4)}) 



Let B = (b,^.i...bibo) be a free buddy (i.e., 00 or {Buddy#1} of the 2D meshes); 

T = (tj^., ... t, to) be a ternary string (one of k dimensions to be combined); 

Buddv&SubBuddv-ID-Address-Size Conversion Method is 

1. Compute k combinable buddies C = (c,,.,...CiCo) of B from all possible k dimensions [ computed in k^ steps ]; 
(i.e., for the j'^ dim, if j = 1 , then T = 0* +C = 01 or {Buddy#2} of the 2D meshes) 

2. Identify combinable sub-buddies (for each C) as a set of ternary string S = (s^., ... s^ Sq) by replacing (k-j) O’s/I’s 
in T with *; and replacing j *’s in T with j b’s [ computed in 1^ steps forkC’s ]. 

[ Formally, for V bit i = 0,1 ,2, .., k-1 , Ci= t ; S; = *; (if t| = 0/1) or C| = b/ S| = b,; (ift| = *) ] 

(i.e., S = *0 + (00, 10} or (SubBuddy#1, SubBuddy#3} of the 2D meshes) 

3. Convert the set S by using “2' combinable-Buddy-ID-Address-Size conversion” algorithm (in Fig. 5.), including a 
number of combined sub-systems (n,’xn 2 ’...xn^’) and their base addresses [ computed in (k-j)(3k^'‘+k) steps]-, 

4. Compute a base-address & size of each combined S from 2 Buddies in the set [ computed in k^ steps for k C's ]: 

4.1 a final base address = the base address of the min buddy-ID (of the combined S); 

4.2 a final combined size = (n,”x n^’ . . . x n„”), for Vi, i = 1 , 2 k; 

n”= n/(of the min buddy-ID) -i- n/ (of the max buddy-ID) if (bi., (of the min buddy-ID) 

^ b(.i (of the max buddy-ID)); 

n”= ni’(of the min buddy-ID) otherwise. 



Fig. 6. The “Buddy&SubBuddy-ID-Address-Size conversion” algorithm (Algorithm C.3-1). 
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Let T = (tj, ... be a ternary string of a pair of combinable 2 buddies 
SubBuddv&SubBuddv-iD-Address-Size Conversion Method is (in this case j = 1 ) 

1. Convert T to a set of (2 eiements) buddy-ID {B,,Bj/B^= (bi^, ... bi, bij } of 2 partiaiiy free buddies at level 
L by creating 2' binary-numbers (r|., . . . r, r„); replacing them into j’*s of {t^_, . . . t, tj [computed in k2 steps], 

2. idenfify combinabie sub-buddies of each of 2 buddies as a set of ternary string S (s,., . . . s, s„) [ computed 
in 12 steps]-, for V bit I = 0, 1, 2, .., k-1, bk = t.; b2. = t.; and s^ = *; s2. = *; ift. = 0or1, 

or bL = 0.; b2^ =1 ^ and si . = bL’; s2^ = b2/; if t = * 

3. Simiiar to that of aigorithm in Fig. 6. 

4. Simiiar to that of aigorithm in Fig. 6. 



Fig. 7. The “SubBuddy&SubBuddy-ID-Address-Size conversion” algorithm (Algorithm C.3-2) 



2.4 Best-Fit Heuristic for Allocation Decision 

In this sub-section, we present a generalized best-fit heuristic for the partitionable k-D 
mesh-connected systems. The best-fit heuristic is to find the best free sub-system (of all 
possible available sub-systems) for an incoming task by introducing a number of criteria 
that tend to cause the minimum system fragmentation(s), which are: 

Best-Fit Criteria : 

1. Find all free sub-systems that can presen/e the “maximum free size" as possible [ see AtgorithmA.1 in 

0(k) time ]. 

2. If there are many candidates (sizes > the request) that have the same property in (1), then the candidate 

that gives the “minimum different size factor (diffSF)” is selected [ see AigorithmA.2 in 0(l2) time ]. 

3. If fhere are many candidates (sizes > the request) that have the same property in (1) & (2), then the 
“smallest size” candidate that yields the “minimum combining factor (CF)” [ see Aigorithm A.3 in 0(k) 
time ] is selected. Otherwise, select by random. 

4. After searching on all nodes in the k-Tree, 

- If fhe best free sub-system is “equal to” the request, then it is directly allocated to the request. 

- Otherwise (it is “larger than” the request), it is partitioned and one of its buddies which yields proper- 
ties similar to that given in Step 1 - Step 3 plus being the “best buddy location” or providing the 
“minimum modified CF (MCF)” will be selected [ see Algorithm A.4 in 0(k2) time ]. 

Note: Criteria 1-3 are applied for every free node or combined sub-system; however, a criterion 4 is com- 
puted only once for the best free sub-system, obtained from Steps 1-3. 



For example, given a 64x64-system with two tasks (20x20, 22x15) allocated (see Fig. 8). 




For the new task (22x5), searching process starts from the root and then visits all nodes in the k-Tree (in order to find 
the best free S). After applying Steps 1-3 of the best-fit heuristic, the best free node that can accommodate the 
request is Si(22, 5) (with min diffSF, min CF) at level 3 (see Fig. 8.a). Therefore, it is allocated to the requested task 
22x5. For the next task (15x10) (see Fig. 8.b), after applying the best-fit criteria (Steps 1-3), the best sub-system is 
S(22, 15) at level 3 since it can preserve the maximum free size (64x44). Since the sub-system S(22, 15) is larger 
than the request 15x10, Step 4 is applied that is the S(22, 15) is partitioned. After partitioning, the rotated size 10x15 
(S 2 ) provides the better best-fit value (diffSF = 1, free size = 180) than that of the regular size 15x10 (S,) (diffSF = 2, 
free size = 110). Finally, among S 2 (the top buddy) and S 2 ’ (the bottom buddy), the S2’(10,15) is selected and allo- 
cated to the request task (15x10) since it yields the minimum modified CF (MCF) (see AlgorithmA.4). 

Fig. 8. An example of the “best fit” k-Tree-based allocation. 
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ALGORITHM A. 1 : “Overlap Status” of two available sub-systems Sj (the maximum free size) 
and Sj (any free sub-system) is identified as follows: If these two sub-systems (Sj, i = I, 2) in 
the k-Tree have base-address a. = (aj, a^ a^j^), last-cover address Pj = (hj, h^, b^,,), and 

size |Sj = (nj,x nj 2 X...x aj, where h. = a.-l- a - 1, i = 1, 2, and j = 1,2,..., k, then Case 1: the 
sub-system is a subset of the sub-system Sj if (a^^ > a^ and b^^ < by) for Vj, j =1,2,. .., k. Case 
2: the sub-systems S, and are disjoint either if (ay-a^ > n^) or if (a^-ay > Oy) for 3j, j = 1, 2, 
..., k. Otherwise (neither Case 1 nor Case 2) the sub-system Sj intersects the sub-system S^. 
Each of these three statuses (subset, disjoint, intersect) can be computed in 0(k) time. 



For instance, the following figure illustrates various overlap statuses between S, “the maximum free size” [ n, = 6 
X 6 at <(1, 5), {6, 10)> ] and other free sub-systems such as S 2 [ n 2 = 4 x 4 at <(5,1), (8,4)> ], S ’2 [ n ’2 = 2 x 4 at 
<(5, 7), {6, 10)> ], and S ’2 [ n” 2 = 4 x 2 at <(5, 5), {8, 6)> ]. 



4x4 


Si 

6x6 


4x4 

S 2 




2x4 S ’2 


4x2 

S”2 


2x4 



Si and S 2 are disjoint since for i = 2 
there exists (ai 2 - 022 ) = (5 - 1 = 4) > 022 
(= 4). S ’2 is subset of Si since S ’2 = 
<(5,7), (6, 10)> c S, = <{1, 5), (6, 10)> 
(or Vi = 1 , 2; a 2 i > an; and b 2 i < bn). S ”2 
intersects to Si since they are not 
disjoint and also neither is a subset of 
the other. 




ALGORITHM A.2 : “Task Rotation” in this paper is a process of shifting a given task size k-1 
times to find the suitable location of the free sub-system for the requested (k-D) task of size Pj 
x Pj X . . . X P|^. Thus, there are k possible rotated sizes that can be allocated for the task: (p^x p^ 
x ... X P|^), (PjX PjX ... X Pi^x Pj), ..., and (p^^x p^x ... x p,^j). For the partitioning of each ro- 
tated task size against a given free S, the different size factor (0 < diffSF < k) and the maxi- 
mum (remaining) free size (FS) (0 < |FS after partition! < |FS before partition!) ^re computed in 
0(k) time and hence in O(k^) time for all k possible rotated sizes. See an example of the task 
rotation in Fig. 8.b that is the rotated 82(10x15) is selected rather than the regular Si(15xl0)). 



ALGORITHM A. 3 : “Combining Factor (CF)” of any free sub-system S (at level L) is com- 
puted from its adjacent neighbor nodes as a summation of the probability of combining (PC) of 
each of T*" combinable nodes of the same root sub-tree. 



In this study, for an adjacent side we define PC=0 if that particular combined side is one of 2 '' system boundaries 
(since that side cannot be combined), PC=% if its adjacent node of that particular side is busy (since it can be 
combined after it becomes free); PC=Vz if its adjacent node is partially available (and some free sub-buddies may 
be combined); or PC=1 if its adjacent node is free (or it can be immediately combined). Then, CF(a) is the 
combining factor of a is CF(a) = CFi(a,|3) + CF 2 (a,y) , where CFi(a,(3)=PC(a,(3i)-tPC(a,(32)-t...-tPC(a,(5i,) is the 
combining factor of a at L-1 and CF 2 (a, y) = PC(a,yi)+PC(a,y 2 ) +...+PC(a,yk) is the combining factor of a at L-2: 



Let a denotes a binary-ID (b^ , ... b,bj of a considering node at level L 

k is a number of combinable buddies of the root sub-tree (of a) at level L-1 or L-2. 

P,,P 2 Pj denote binary-IDs of combinable node(s) of the 

considering node a with the same root sub-tree at level L-1 

y,, y^ y, denote binary-IDs of adjacent node(s) of the 

considering node a with the same root sub-tree at level L-2 

Identify adjacent nodes by using the following rules: 

1) For a k-Tree node, each p. or y. is identified by negating 
the i* bit of that node (a) which are p, = (b^,, ... b,b„’), p^ 

= (bj_, ... b,’bj and p^ = (b^,,’... b,bj). Therefore, 

for a root(a) = (r^, ... r/J, its combinable buddies are y, 

= (L, ■■■ r,r„’), Y, = (r„ ... r,’r„) and y^ = (r,.,’ ... r,r„), 

respectively (See the following example). 

2) For a combined sub-system S = (L, . . . t,tj of 2' free (or 

partially free) buddies, 1 < j < k, (or 2! combined nodes) 
and p. = negate the i* 0/1 (or non *) bit, represented S 
as a binary number, where i=1 , 2 j. 




Given an 8x8-mesh. Let a = b,b„ =11 (or 
4), residing at level 3 and 2 adjacent bud- 
dies of a are p, = 10 (or 3) ; p^ = 01 (or 2). 
Assume the root of node a at level 2 = 10 
(or 3) and then 2 adjacent nodes of the root 
(a) are y, =11 (or 4) andy^ = 00 (or 1). 
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ALGORITHM A.4 : “The Best Buddy {or Best Sub-partition)’’ is applied after partitioning (for 
Step 4 in the best-fit heuristic). In this case, assume the best free sub-system from Steps 1-3 is 
the node S, whose size is larger than the requested task. Then, the node S will be partitioned 
into 2'‘ buddies and the best sub-partition (or one of 2‘‘ buddies will be allocated to the request. 
Let Pj = (Pj,, Pjj, ..., Pj^) be a set of combinable buddies of the considering node (a), where i = 

1, 2, ..., 2‘‘. Since the combining factor (see Algo. A. 3) CF(a, P) is the same for all ttj, i = 1, 

2, ..., 2'‘, the modified combining factor MCF {a., p.) for each ttj is computed as MCF (a., P^.) 
= PC (ttj, P,j) in 0(k2'‘) time and the one (yielding min MCF) is selected as the best buddy. 

See an example of finding the best buddy node in Fig. 8.b: the MCF of S 2 (10x15) = 1+1+14+0 = 214 
(since its four combinable boundaries are two free node, one busy node, and one system boundary) 
and the MCF of S ’2 = 1+1 +0+0 = 2 (since its four combinable boundaries are two free nodes and two 
system boundaries); and hence the S ’2 yieids the minimum MCF and it is selected. 



2.5 Allocation/Deallocation Decision Algorithm 



In “the best-fit k-Tree-based processor allocation’’ procedure, the searching starts from the k- 
Tree’s root and goes to the left most (leaf) node. If that node is free and its size can accommo- 
date the request, then its best-fit value (see Section 2.4) is computed. Then, the best sub- 
system is updated if the new free sub-system yields the better best-fit value (since it tends to 
cause the minimum system fragmentation). The above process is repeated for the next node in 
the k-Tree (if there exists). After all nodes (including each leaf node and each internal node 
(for a number of combined sub-systems: see Section 2.3)) are visited, the final process is ap- 
plied, which is either 1) to allocate the best sub-system directly to the request (if its size is 
equal to that of the request) or 2) to partition the corresponding node (see Section 2.2) for the 
request (since its size is larger than that of the request). 

Finally, whenever a task is finished, “the k-Tree-based processor deallocation’’ procedure 
is applied by searching for the location of the finished sub-system starts from the k-Tree’ s root 
and goes to the subset path until reaching the leaf node that stores information of the finished 
task. After finding the corresponding k-Tree’ s node of the finished task, its status is updated 
(or removed from the k-tree). Finally, the combining process is recursively applied from the 
finished node(s) to the root (if it is possible). 

Note: the expand-node-size function (in the allocation process) will be applied if a com- 
bined sub-system is selected as the best sub-system for the current incoming task in order to 
limit the number of nodes in the k-Tree and provide the same methodology to update and parti- 
tion as a regular (free) leaf node. Then, other corresponding nodes in the combined sub-system 
have to be updated as busy. Finally (in the deallocation process), these corresponding busy 
nodes will be free whenever that expanded node is free (in the resume-node-size function). 



For example, assume that in the current system, there are three tasks allocated: Buddy#1 and 
Buddy#2 at level 2, Buddy#1 at level 3 and suppose that a new incoming task requests a sub-system 
of size 5x6 (which is larger than each buddy node but less than a combined sub-system of size 6x6). 





2x2 


2x4 




1 

1 


4x4 


6x6 



Therefore, the expanded node 
processing of a combined (6x6) 
sub-system (stored in the Buddy#4 
(at level 2) of the root) is applied 
before partitioning. Note: the old 
information of this node is also 
stored (in dashed node) for re- 
suming later when it is finished. 






214 



Jeeraporn Srisawat and Nikitas A. Alexandridis 



2.6 Time Complexity Analysis 

Let N be the system size (N = Oj x x . . . x n^), be the maximum number of allocated tasks 
(N^ < N), Np be the corresponding number of free nodes in the k-Tree (N^ + Np < N), and M be 
the maximum number of nodes in the k-Tree (where M = external (leaf) nodes -l- internal (non- 
leaf) nodes < 2N). 

THEOREM 1 : Time complexity of the k-Tree-based allocation to find the best free sub-system 
for each incoming task on a k-D mesh (of size N= n^ x n^x. . .x n^) is 0(k‘'2'‘(N^-l-Np)-l-k^2“). 

PROOF: In the allocation algorithm (see Section 2.5), a number of recursive iterations of the DFS 
(depth first search) are at most a number of nodes in the k-Tree and only nodes whose sizes are 
larger than (or equal to) the request are visited. In this (non-bit map) approach, the number of nodes 
in the k-Tree is proportional to the number of allocated tasks or busy nodes (Na) and the number of 
free nodes (Np) in the k-Tree, where Na -i- Np < N and Np < (2 -1 )Na. Since the number of external 
(or leaf) nodes in the k-Tree are at most Na -i- Np < N and the number of internal nodes are at most 
(#leaf nodes-1) divided by (2*^-1); therefore the total number of nodes in the k-Tree is at most M 
nodes, where M = (Na-i-Np) -i- (Na-i-Np- 1) / (2*^-1). For each (free) leaf node (of Np nodes), the best- 
fit value is computed in O(k^) time and hence O(k^Np) for all leaf nodes. For each internal node (of 
(NA-tNp-l)/(2'‘-l) nodes), the best-fit value is computed in O(k^) time for each node and hence 
0(k‘'2^'^) time for k^2*‘ -t k^2^*‘ combined sub-systems, as summarized in Table 1. (Note that the 
maximum free size (in Step 0) is computed only once for each task by using DFS in 0(k^2*‘ (Na-tNf)) 
time.) Finally, after finding the best free sub-system (Step 3), if its size is equal to the request, then 
it is directly allocated to the request. Otherwise, the network partitioning and the best sub-partition 
will be applied, which can be computed in 0(k2*‘) time. Then, the corresponding node(s) in the k- 
Tree is updated in 0(k) time. Note: for the combined sub-system that is larger than the request, 
before partitioning, expand-node-size process is applied in 0(k^2^'‘) time. Thus, total time complex- 
ity to visit all nodes in the k-Tree is approximately [NA-tkNp-l-k^2^'‘(NA-rNF-l)/(2'‘-l)] + [NA-rk^Np] -l- 
[(k'*2^'"-rk^2‘‘)(NA-rNF-l)/(2'‘-l)] -t- [k^2^‘‘-i-2k2'‘-i-k] = 0(k‘*2''(NA-i-NF) -r k^2^'"), where Na-i-Nf< N. 



Table 1. Time complexity of the k-Tree-based approach for the k-D meshes. 



Functions in each k-T tee’s node for sub-system Allocation (see Section 2.5) 


Time complexity 


(0) Before searching to find the best free sub-system 




■ Compute a maximum free size {from all M nodes in the k-Tree) 

Busy Leaf -i- Free Leaf -i- Internal Node operations [ s NA+kNF+k^2^‘'{NA+NF-1 )/(2‘‘-1 ) ] 


0(k’’2"(NA-t Nf)) 


(1) Leaf node operation (for Nf nodes) : - Compute best-fit value O(k^) f s Na -t- k^ Nf 1 


0(nIT1?N^) 


1 (2) Internal node operation (for (Na+Nf -1)/(b-1) nodes) 


[s(k'‘2“-tkV)(NA-tNF-1)/(2"-1) ] 


0(k''2“(NA-t Nf)) 


Sub-system combininq:Alqo.C.2 0(k^2'^) and Alqo.C.3 (next level) 0(k^2^'^)) 




- Compute best-fit value (of all combined subsystems) 0(k'‘2^'') 




(3) Partitioning after finding the best free node 


I s k"2*-t 2k"2“-t k I 


0(k"2“) 


- Expanding node (for a combined sub-system) 


0(k"2“) 




- Best sub-partition 


““0(k?) 




- Network partitioning 


0(k2“) 




- Allocate (update k-Tree) 


_0(k) 




I Total time (for M nodes) = (0) - 1 - (1) - 1 - (2) - 1 - (3) 


0(k''2“(NA-t Nf) -t k“2"“) 



Note:Our k-Tree-based model, when applied to the 2-D/3-D meshes, provides a linear time complexity O(Na-tNf) 



THEOREM 2 : Time complexity of the k-Tree-based deallocation to free the particular k-Tree 
node that stores the finished task and to combine the free buddy nodes of the root sub-tree to 
the root of the k-Tree on the partitionable k-D mesh (N = nj x n^ x . . . x n^) is 0(n2'‘-l- k^2^'‘), 
where n = max(nj, n^, . . ., n^). 

PROOF: Let n be the maximum depth of the k-Tree (n = max(ni, U 2 , ..., nO). Searching for the 

location of a finished sub-system from the root is at most n(2*‘) steps. Then, combining all 2^^ buddy 
nodes from the finished sub-system to the root (if it is possible) takes another 0(2*^) steps. Finally, 
the resume-node-size process of the expand-node-size process (if any) may be required in 0(k^2^'‘) 
time. Therefore, total time complexity of the k-Tree-based deallocation is 0(n2’‘-r k^2^*‘< M). 
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3 Performance of the k-Tree-Based Approach on the 2-D Meshes 

In order to evaluate the system performance, the generalized k-Tree-hased approach was de- 
veloped. By simulation study, a number of experiments are performed to investigate the effect 
of applying the “k-tree-based model” for performing processor allocation/deallocation for the 
partitionable 2-D meshes and compare to recently 2-D mesh-hased strategies ([3], [15], [7]). 
The investigated system performance includes system utilization, system fragmentation, aver- 
age allocation time, etc. For each experiment, a number of simulation time units are iterated 
around 5,000-50,000 time units and a number of incoming tasks are generated approximately 
1,000-10,000 tasks, according to the setting of the system size parameter, the task size (i.e., 
row, column) parameter and the task size’s distribution. For each evaluated result, a number 
of different data sets are generated and the algorithm is repeated until an average system per- 
formance does not change (or at least 100 iterations). Experimental results of applying the k- 
Tree-based strategy are represented for both static system performance (with concerning proc- 
essor allocation for incoming tasks (or jobs) only (or it is assumed that no task finishes during 
the considering time)) and dynamic system performance (with taking into account of dealloca- 
tion for some finished tasks). In this study, in order to set the same incoming tasks and envi- 
ronment to all strategies for the comparison purpose, the static system performance is con- 
cerned (i.e., when we measure the system utilization and system fragmentation); otherwise the 
dynamic system performance is concerned. In each experiment, two task-size distributions are 
considered: the Uniform distribution U(a, (3) and the Normal distribution N(p, a). For each of 
these distributions, the system sizes (N = RxC) are varied and the task sizes [1x1 - RxC] are 
generated, where a = 1, (3 = max (R, C) for the Uniform distribution U(a, P) and p=a=max (R, 
C)/2 for the Normal distribution N(p, a). Other parameters are fixed such as task arrival rate ~ 
Poisson (X) (or inter-arrival time ~ Exp(l/)t=5)), and service time ~ Exp(p=10), etc. 

In Experiment 1, we investigated ""the effect of system sizes to the system utilization (U,, ^) 
and the system fragmentation (E^ ^)”. In this experiment, the system sizes (N = RxC) were 
varied and the task sizes (1x1 - RxC) were generated and fixed. In Table 2 (the system utili- 
zation result), for all test cases the k-Tree strategy performed -60% system utilization which 
was comparable to those of the recently 2-D mesh-based strategies (i.e., for the uniform distri- 
bution, the ESL, the BE, and the QA strategies yielded -56%, -57%, and -56% system utiliza- 
tion, etc.). Eor the system fragmentation, the k-Tree approach and these existing strategies 
also performed the same results for the system fragmentation since there was no internal sys- 
tem fragmentation (E^ ^ = 1-U^ J. In Experiment 2, we investigated ""the effect of task sizes to 
the system utilization and the system fragmentation"’ . In this case, the system size was fixed (N 
= 512x512) and the task sizes were generated and varied. In Eig. 9. (the system fragmentation 
result), for all test cases the k-Tree strategy performed the comparable system fragmentation to 
the ESL, BE, and QA strategies, which were -30%, -40%, and -41% system fragmentation 
for task sizes [1x1-250], [1x1-350x350], and [1x1-512x512], respectively. Eor the system 
utilization, the k-Tree approach and these existing 2-D mesh-based strategies also performed 
the same results since U^ 1-E^ ^ (or no effect of the internal system fragmentation). In Ex- 
periment 3, we investigated ‘"the effect of allocation time” of the k-Tree and existing 2-D mesh- 
based strategies when the system sizes were increased. In Eig. 10. (the average allocation 
time), our k-Tree approach yielded the improved average allocation time, compared to the 
existing strategies for all tested cases, except when the system size was small (N = 64x64). In 
this case, the average allocation time of the k-Tree, ESL, and BE strategies were approximately 
constant since they depended on the number of allocated tasks (NJ. However, the average 
allocation time of the QA strategy was increase linearly which depended on the number of 
allocated tasks (N^) and the system size (N). In Experiment 4, we investigated ""the effect of 
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allocation and deallocation time'’ when the system sizes were increased. In Fig. 11.. (the aver- 
age allocation and deallocation time), our k-Tree approach yielded the improved average allo- 
cation and deallocation time, compared to the existing strategies when system sizes were in- 
creased. In this case, the average allocation and deallocation time of the k-Tree was approxi- 
mately constant while those of existing 2-D mesh-based strategies were increased linearly. 



Table 2. Effect of “the system sizes” to the system utilization (%). 



Task size 
Distributions 


System sizes 
(N = RxC) 


k-T ree 


Free Sub-List 
(FSL) 


Busy List 
(BL) 


Quick Aliocation 
(QA) 


Uniform (a, p) 


64 x 64 


60.97 


56.06 


57.27 


55.64 


a=1 , p=min(R,C) 


128x128 


60.11 


56.98 


56.67 


57.84 


For [1x1 -RxCl 


256 X 256 


59.77 


55.77 


54.61 


55.87 




512x512 


58.86 


56.11 


55.91 


56.53 


Normal (it, 


64 x 64 


61.18 


61.42 


58.46 


57.53 


H = min(R,C)/2 


128x128 


59.16 


56.89 


58.02 


54.79 


For [1x1 -RxCl 


256 X 256 


60.10 


58.02 


55.12 


55.09 




512x512 


61.28 


57.04 


57.22 


55.89 




Fig. 9. Effect of “the task sizes” to the system fragmentation (%). 





Fig. 11. Effect of “the system sizes” to the average allocation and deallocation time. 



4 Conclusion 

In this paper, we present the design and the development of the “generalized k-tree-based sub- 
mesh allocation” model for the partitionable multi-dimensional mesh-connected systems. Time 
complexity of the k-Tree-based approach is 0(N^ -l- Np) for the partitionable 2-D and 3-D 
meshes and 0(k‘'2'‘(N^-l-Np)-l-k^2“) for the partitionable k-D meshes, where is the maximum 
number of allocated tasks and N^is the corresponding number of free sub-meshes. By Simula- 
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tion studies, a number of experiments were performed to investigate the system performance of 
applying the k-Tree-based model for the partitionable 2-D meshes. In the experimental results, 
the system performance (i.e., the system utilization and system fragmentation) of the k-Tree- 
based model for the partitionable 2-D meshes was comparable to existing 2-D mesh-based 
strategies. In addition, the k-tree-hased approach yielded the improved allocation/deallocation 
decision time fi.e., the average allocation time, the average allocation and deallocation time), 
compared to those 2-D mesh-based strategies when the system sizes (N) are very large. 
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Abstract. Several analytical models of fully adaptive routing in wormhole- 
routed k-ary n-cubes under the uniform traffic pattern have recently been 
proposed in the literature. This paper describes the first analytical model of fully 
adaptive routing in k-ary w-cubes in the presence of non-uniform traffic 
generated by the digit-reversal permutation, which is an important communica- 
tion operation found in many matrix computation problems. Results obtained 
through simulation experiments confirm that the model predicts message 
latency with a good degree of accuracy under different working conditions. 



1 Introduction 

Most current multicomputers, e.g. Cray T3E [2], and Cray T3D [14], employ k-ary n- 
cube interconnection network for low-latency and high-bandwidth inter-processor 
communication. The k-ary n-cube has an w-dimensional grid structure with k nodes in 
each dimension such that every node is connected to its neighbouring nodes in each 
dimension by direct channels. The two most popular instances of k-ary n-cubes are the 
hypercube (where k=2) and the 2- and 3 -dimensional torus (where n = 2 and 3) [10]. 

Current routers significantly reduce message latency by using wormhole 
switching (also widely-known as wormhole routing [17]). Wormhole routing divides 
a message into elementary units called /Z/ts, each of a few bytes for transmission and 
flow control, and advances each flit as soon as it arrives at a node. The header flit 
(containing routing information) governs the route and the remaining data flits follow 
it in a pipelined fashion. Moreover, throughput in wormhole routed networks can be 
increased by organizing the flit buffers associated with a given physical channel into 
several virtual channels [7]. 

Most interconnection networks, including k-ary n-cubes, provide multiple paths 
for routing messages between a given pair of nodes. Deterministic routing, where 
messages with the same source and destination addresses always take the same 
network path, has been popular because it requires a simple deadlock-avoidance 
algorithm, resulting in a simple router implementation [10]. However, messages 
cannot use alternative paths to avoid blocked channels. Fully-adaptive routing 
overcomes this limitation by enabling messages to explore all alternative paths. 
Several authors like Duato [9], Lin et al [16], and Su and Shin [24] have proposed 
fully-adaptive routing algorithms, which can achieve deadlock-freedom with only one 
extra virtual channel compared to deterministic routing. 

Analytical models of deterministic routing in common wormhole-routed 
networks, including the k-ary w-cube, have been widely reported in the literature [1, 5, 
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6 , 8 ]. Several researchers have recently proposed analytical models of fully-adaptive 
routing under the uniform traffic pattern [3, 20, 18]. 

A number of studies [10] have revealed that the performance advantages of 
adaptive routing over deterministic routing are more noticeable when the traffic is 
non-uniform. Many real-world parallel applications in science and engineering exhibit 
these kinds of traffic patterns [13, 23]. For instance, computing multi-dimensional 
FFTs, finite elements, matrix problems and divide and conquer strategies exhibit 
regular communication patterns [ 12 ], which are highly non-uniform as they put 
uneven bandwidth requirement on network channels. Permutations patterns, such as 
digit-reversal, matrix-transpose, shuffle, exchange, butterfly and vector-reversal are 
examples of regular patterns that generate typical non-uniform traffic in the network 
(see [12, 23] for more details on these permutations). To the best of our knowledge, 
there has been hardly any attempt to propose an analytical model of adaptive routing 
in wormhole-routed networks under the non-uniform traffic patterns including 
permutation traffic patterns. In fact, most existing studies have resorted to simulation 
to evaluate the performance of adaptive routing under such traffic conditions [ 8 , 16, 
19, 24]. In an effort to fill this gap, this paper proposes the first analytical model for 
computing message latency in the k-ary w-cube with fully adaptive routing in the 
presence of digit-reversal permutation traffic, which is one of the most important 
permutations found in typical parallel applications, such as matrix problems and 
radix-k FFT computation [12, 13]. The model is developed for Duato's fully adaptive 
routing algorithm [9], but the modelling approach can equally be used for the other 
routing algorithms described in [16, 24]. 



2 The A:-Ary n-Cube and Duato's Routing Algorithm 

The unidirectional k-ary w-cube where k is referred to as the radix and n as the 
dimension, has A=k" identical nodes, arranged in n dimensions, with k nodes per 
dimension. Each node can be identified by an «-digit radix k address (ai, 02 ,..., fl„). 
The i"' digit of the address vector, a„ represents the node position in the i dimension. 
There is a link from node (ai, 02 
,...,an) to node (bi,b 2 ,...,b„) if 
and only if there exists an i, 

(l<i<n), such that a,- =(h,- -l-l) 
mod k and aj = bj for ( 1 < 7 < m ; 
i^j). Each node consists of a 
processing element (PE) and a 
router, as illustrated in Eig. 1 . The 
PE contains a processor and 
some local memory. The router 
has {n + \) input and {n + \) 
output channels. A node is 
connected to its neighbouring 
nodes through n inputs and n 
output channels in a 
unidirectional k-ary «-cube. The 
injection channel is used by the 
PE to inject messages to the 
network (via router) and 
messages at the destination eject 
the network via the ejection 
channel. Each physical channel is 
associated with some, say V, Fig.l- The node structure in the k-ary «-cube. 
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virtual channels. Each virtual channel has its own flit queue, hut shares the bandwidth 
of the physical channel with other virtual channels in a time-multiplexed fashion [7], 
The router contains flit buffers for any incoming virtual channel. An (nH-l)V-way 
crossbar switch directs message flits from any input virtual channel to any output 
virtual channel. Such a switch can simultaneously connect multiple input to multiple 
output virtual channels while there is no conflicts [7, 10]. 

The high cost of adaptivity [4] has motivated researchers to develop adaptive 
routing algorithms that require a fewer number of virtual channels. Several authors 
have proposed routing algorithms that exhibit a trade-off between adaptivity and the 
number of virtual channels required to ensure deadlock-freedom [10]. For instance, 
planar adaptive routing [10] and the turn model [1 1] are partially adaptive. 

Authors in [9, 16, 24] have proposed fully-adaptive routing algorithms, which 
can achieve deadlock freedom using a minimal number of virtual channels. Their 
proposed algorithms require only one extra virtual channel per physical channel, 
compared to deterministic routing, allowing for an efficient router implementation. 
For instance, Duato’s algorithm divides the virtual channels into two classes: a and b. 
At each routing step, a message visits adaptively any available virtual channel from 
class a. If all the virtual channels belonging to class a are busy, it visits a virtual 
channel from class b using deterministic routing. Duato’s algorithm requires at least 
three virtual channels per physical channel to ensure deadlock-freedom where the 
class a contains one virtual channel and class b owns two virtual channels. When 
there are more than three virtual channels, network performance is maximised when 
the extra virtual channels are added to class a [9]. Thus, when V virtual channels are 
used per physical channel in a A:-ary n-cube, the best performance is achieved when 
class a and b contain (V-2) and 2 virtual channels respectively. 



3 The Analytical Model 

The model uses assumptions which are commonly used in the literature [1, 5, 7, 8, 18, 

20 , 21 ]. 

a) There are two types of traffic in the network: "uniform" and "digit-reversal". In the 

uniform traffic pattern, a message is destined to any other nodes in the network 
with equal probability. In the traffic pattern generated according to the digit- 
reversal permutation [12], a message generated in the source node x = xjX 2 •••x„ is 
destined to the node d(x) = ■■■Xi . Let us refer to these two types of messages 

as uniform and digit-reversal messages respectively. When a message is generated 
it has a finite probability a of being a digit-reversal message and probability 
(I- a) of being uniform. When a =0 the traffic pattern is purely uniform while 
a = 1 defines a pure digit-reversal traffic. A similar traffic model has already been 
used by Pfister and Norton [19] to generate non-uniform traffic patterns containing 
hot spots. 

b) Nodes generate traffic independently of each other, and which follows a Poisson 
process with a mean rate of Ag messages/cycle. Therefore, the mean generation 
rate of the uniform and digit-reversal traffics are respectively (l-a)Ag and aAg . 

c) Message length is B flits, each of which is transmitted in one cycle between two 
adjacent routers. 

d) The local queue in the source node has infinite capacity. Moreover, messages are 
transferred to the local PE through the ejection channel as soon as they arrive at 
their destinations. 

e) V virtual channels are used per physical channel. With Duato’s fully adaptive 
routing algorithm [9], class a contains (V-2) virtual channels which are crossed 
adaptively and class b contains two virtual channels which are crossed 
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deterministically (e.g. in an increasing order of dimensions). Let the virtual 
channels belonging to class a and b he called the adaptive and deterministic virtual 
channels respectively. When there is more than one adaptive virtual channel 
available a message chooses one at random. To simplify the model derivation, no 
distinction is made between the deterministic and adaptive virtual channels when 
computing the different virtual channels occupancy probabilities [3, 20, 18]. 

The mean message latency is composed of the mean network latency, 5 , that is the 
time to cross the network, and the mean waiting time seen by a message in the source 
node, ITj . However, to capture the effects of virtual channels multiplexing, the mean 

message latency is scaled by a factor, V , representing the average degree of virtual 
channels multiplexing, that takes place at a physical channel. Therefore, the mean 
message latency can be approximated by 

Latency = (S+W^)V (1) 

Examining the address patterns generated by digit-reversal permutations reveals that 
we need to consider two cases where n is even and odd separately for computing the 
different quantities, 5 , , and V . This is because when n is even all network 

channels receives both uniform and digit-reversal traffics. However, when n is odd 
not all channels receives both types of messages. While the channels at the center 
dimension (i.e., (mh-1)/2) receive uniform messages only, channels at the other 
dimension receive the uniform as well as digit-reversal messages. 



3.1 Outline of the Model When n Is Even 

Given that a uniform message can make between 1 and n{k-\) hops (i.e., the network 
diameter), the average number of hops that a uniform message makes across the 
network, L„ , is given by 

n{k-l) 

K = I'/], (2) 

i=l 

where P„. is the probability that a uniform message makes i hops to reach its 

destination. The average number of hops that a uniform message makes in each 
dimension can therefore be given by [1] 

k^=dJn = (k-l)/2 (3) 

In [18, 21], the number of way that i hops can be distributed among the dimensions of 
a k-ary n-cube with at most h hops in each dimension is derived to be 

1=0 (4) 

Thus, the probability that a uniform message generated in a given source node, makes 
i hops to reach its destination, P ^. , can be written as 

=D*-‘(;,M)/(A^-l) = |;(-l)'([)(‘t-T‘)/(iV-l) (5) 

(=0 / 

Let us now calculate the average number of hops that a digit-reversal message 
makes across the network, . The number of possible combinations, where the 
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address patterns x^x 2 ■ 
given by [22] 



and x\x' 2 ---x'„ differ in exactly i digits (/= 0, 1,..., n), is 






0 



if i is even 
otherwise 



( 6 ) 



Hence, the probability that the source and destination addresses for a digit-reversal 
message differ in i digits, , can therefore be written as 



P,. =- 



.Jfe) 



(k-if 



N -n 



do 



0 



if i is even 
otherwise 



(7) 



Let us assume that the i-th digit (i= 0, 1,..., n) in the source address, x„ is different 
from that of the destination address, i.e. x',. Considering all possible values that x, and 
x'i may take (i.e. 0 < X; , x',- < fc ), the average difference between X; and x], which is the 
average number of hops that a digit-reversal message makes in the i-th dimension, is 
given by [1] 

k,=(k-l)l2 (8) 

Thus, the average number of hops that a digit-reversal message makes across the 
network is given by 



Pd — "^^PdPdi 



(9) 



Examining the traffic generated by the digit-reversal permutation shows that a 
fraction of IN of the network nodes send only uniform messages and the 

remaining fraction (1 - / N) send a combination of uniform (with probability 

I -a) and digit-reversal (with probability a ) messages. Using the above equations, 
the average number of hops, L , that a message makes in the network is derived to be 
Z = (M,g / A)L„ + [l - / nIoL, + (1 - a)L„ ) = f + ^„L„ (10) 

where the uniform and digit-reversal messages contribute with the following weights 

C,=a(l-n,JN) (11) 

L=n,JN + (l-a)(l-n,JN) (12) 

Adaptive routing enables a message to explore any available channel that brings it 
closer to its destinations, resulting in an approximately even traffic rate on network 
channels. Since a message makes, on average, L hops in the network the total traffic 
existing in the network at a given time is NLAg. Given that a router in the A:-ary n-cube 
has n output channels the rate of messages arriving at each channel, can be written 
as [1,6] 

A^=LAg/n (13) 

The uniform and digit-reversal messages see different network latencies as they 
cross different number of channels to reach their destinations. If 5„ and denote 
the mean network latency for uniform and digit-reversal messages, respectively, the 
mean network latency taking into account both types of messages is given by 

~S = ^dSd+^uSu ( 14 ) 

Averaging over all possible cases for a digit-reversal message, gives the mean 
network latency for digit-reversal messages, , as 
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S,=lP,,S,, (15) 

i=l 

where Sj. is the network latency for a digit-reversal message whose source and 
destination addresses differ in i digits. As a uniform message makes, on average, 
hops across the network the mean network latency for a uniform message, , is 
given by 



5„ =S + L„+£Pj 



blocki, ■ 
“j 



The term B + in the above equation accounts for the message transmission time 

while the w, accounts for the delay due to blocking at the y-th hop channel 

“j 

(1 < j <L^) along the message path. is the probability of blocking when the 

uniform message arrives at the y-th hop channel and w, is the mean waiting time to 
acquire a virtual channel when the message is blocked. Similarly, the network latency 
for a digit-reversal message whose source and destination address patterns are 
different in i digits, S ^. , is given by 

ikj 

S^.=B + ikj + £ . w, ( 17 ) 

1=1 

where 1^ Ih® probability of blocking when the digit-reversal message arrives 

at the y-th hop channel. A message (either a uniform or digit-reversal message) is 
blocked at the y-th hop channel when all the adaptive virtual channels of the remaining 
dimensions to be visited and also the deterministic virtual channel of the lowest 
dimension to be visited are busy. Given that blocking has occurred a message has to 
wait for the deterministic virtual channel at the lowest dimension. To compute the 
probability of blocking for a uniform message, , let us consider a uniform 

message making hops across the network ( hops in each of n dimensions) that 
has arrived at the y-th hop channel along its path. Such a message may already have 
passed up to (y-l)/fc„ dimensions. If I (0 < / < (y-l)/fc„) dimensions are passed, 
then the message still has to cross the remaining (n-l) dimensions. Therefore, the 
probability of blocking can be expressed as 



where P is the probability that / dimensions are passed at the y-th hop channel, 

P„ is the probability that all adaptive virtual channels at a physical channel are busy, 
and is the probability that all adaptive and deterministic virtual channels at a 
physical channel are busy. Since / can be 0, 1, ..., or (y-l)/k„, the probability of 
blocking at the y-th hop channel is given by 

Pblockii . Pblockjf. . (^9) 
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The probability that / dimensions are passed at the y-th hop channel, 

be computed as follows. The number of combinations that / particular dimensions are 
passed is . These I dimensions can be chosen from n dimensions in 

(") ways resulting in a total of (j - lk^,n - 1) combinations that / dimensions 

may be passed. Dividing this by the total number of combinations that j hops can be 
made over n dimensions gives the probability that a uniform message has passed I 
dimensions at its y-th hop as 

( 20 ) 



passui^ 






Let us now consider a digit-reversal message that passes i dimensions to reach its 
destination, i.e. a digit-reversal message whose source and destination addresses differ 
in i digits. Such a message makes ikd hops over i dimensions. Adopting the same 
approach taken above for calculating for uniform messages, we can derive 



Pblockj. j 



block((- 



(j-l)lkd 

= 






* blockj 



= p j 

i.i,j 



(0</<(y-l)/fcJ 



passd^i 






( 21 ) 

( 22 ) 

(23) 



The calculation of the probabilities and P^^^ have already been outlined in 
[18]. If P„ (0<v<F) denotes the probability that v virtual channels are busy at a 
physical channel ( is calculated in the next sub-section), P^ and P^*^ are given in 
terms of P^ as (see [18] for a more detailed derivation of the probabilities P^ and 
Pa&.d ) 

P,=Py+ 2Py_, /(/_! )+ Py_, /{y\ ) (24) 

Pa&d = Py + 2Pv_l /(v -1 ) (25) 

To determine the mean waiting time, w, , to acquire a virtual channel when a 
message is blocked, a physical channel is treated as an M/G/1 queue. The mean 
arrival rate to the channel is (given by equation 13) and the mean service seen by a 

message at a given channel can be approximated by S (given by equation 14). Using 
results from queuing theory, the mean waiting time in the event of blocking can be 
expressed as [15] 



1+C^ 



■ 2(l-p) 

p = xj 



(26) 

(27) 

(28) 



where is the variance of the service time distribution. Since the minimum service 
time is equal to the message length B, following a suggestion proposed in [8], the 
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variance of the service time distribution can be approximated as 






As a result, the mean waiting time, 
message is blocked becomes 



(29) 

to acquire a virtual channel when a 



= AS^ 



1 + {s-bJ/S^ /[2(1-45)] 



(30) 



The probability, , that v adaptive virtual channels are busy at a physical 
channel, can be determined using a Markov chain with V+l states. State 
, (0 < V < y) , corresponds to v virtual channels being busy. The transition rate out 
of state K, to state is the traffic rate (given by equation 13) while the rate out 



of state state is approximated by 1/5 (5 is given by equation 14). The 



transition rate out of state Uy are reduced by to account for the arrival of messages 
while a channel is in this state. The probability can be computed using the steady- 
state equations as 

ksypr o<„<r 

la.«' r-v 

In virtual channel flow control, multiple virtual channels share the bandwidth of a 
physical channel in a time-multiplexed manner. The average degree of multiplexing 
of virtual channels in the network is given by [7] 

y = h^Pi/hPi ^ 32 ) 

The calculation of the mean waiting time, , at the local queue in the source 
node is calculated in the same manner used for calculating the mean waiting time at a 
given network channel (equation 30). The local queue is treated as an M/G/1 queue 
with an arrival rate of X^ IV (recalling that a message in the source node can enter the 



network through any of the V virtual channels), a service time of 5 , and thus a mean 
waiting time of [15] 






1-t 






—2 

IS 



2 ( 1 -^ 5)1 



(33) 



Examining the above equations reveals that there are several inter-dependencies 
between the different variables of the model. For instance. Equations (14) and (15) 
reveal that 5 is a function of 5„ and 5^. while equations (16), (17) and (30) show 



that 5„ and 5^. are functions of 5 . Given that closed-form solutions to such inter- 
dependencies are very difficult to determine the different variables of the model are 
computed using iterative techniques. 



3.2 Outline of the Model When n Is Odd 

As stated above, when n is odd, channels belonging to dimension {n+\)!2 receive 
uniform messages only. The traffic due to digit-reversal messages falls only on the 
channels belonging to dimensions 1, 2, ..., (n-l)/2, (mh-3)/2, ..., n. Let us refer to 
dimension (n+\)H as the "centre-dimension" and the channels belonging to this 
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dimension as the "centre-channels". Similarly, let us refer to other dimensions as 
"other-dimensions" and their associated channels as the "other-channels". When n is 
odd, equations (6) and (18) will change (see [22] for more details) to 

if,' iseven 
otherwise 



0 

= P 



(34) 



passui- 



n — 
/-I 






a &d other 



(n-l) 



where 



Pn = Pv + 

'^centre '^centre ‘^Ci 

<^^^centre '^centre ^ 



P = P 

^other ^other 



2R 



^ Mother 



•Mother ‘^^^center 

re l[v-l )+ Pv-2ee, 

/( ^ ) 

:entre / 

l{v-l )+ Pv-lether ^iv'-2 ) 



I (n-l-l) p„^,. 2 p 



Mother ^center ^^^olher 

i\] 



Mother ^ ^other \V~ 



^ Mother ‘ 

U) 



(35) 

(36) 

(37) 

(38) 

(39) 



While all channels receive uniform traffic, only the channels belonging to other- 
dimensions receive digit-reversal traffic. Therefore, when n is odd the traffic rate 
arriving at each centre-channel, and other-channel, A^ , , can be expressed as 

^ ^centre ^otner ^ 



Ken,er=^uPuKm 

A,.. =A, . /(M-1) 



^other 



^center 



(40) 

(41) 



The mean waiting time for a centre-channel and an other-channel, w, and 

^ ‘centre 



^other 



, can be expressed as 
w 



^centre 






^other 



= [l + K,Uer - b} / - K,Uer ^ .e.Her )] 



(42) 

'’Mother ^ ^olher other ^ Mother ^ ^ (43) 

where 5„ (given by equation 16) and (calculated below) are approximated 

values for service time of a centre-channel and an other-channel, respectively. 
Therefore, the mean waiting time for a channel considering both types of channels 
would be 

+[l-iKfcr (44) 

The mean service time for an other-channel, , can be approximated as 






^other 



1 - 






(45) 



Adapting the approach used to calculate when n is even, we can write the 
expression of the probability of having v busy virtual channels at a centre-channel. 



P„ , and at an other-channel, P„ , , as 

^centre Mother 



_ j ^Pcentre ^^^centre ^ 



^KentreSuY 



0 <v<V 
v = V 



(46) 
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{ ^^other^ Mother ^'^^other^ ^olher^ 

^^‘^other^ Mother ^ 



(47) 



The average degree of multiplexing of virtual channels belonging to a centre-channel 
and an other-channel in the network, and the total average multiplexing degree of 
virtual channels in the network, are given by 



_ V 


/ 




V centre — P; 

^centre 


/ 

/ ‘centre 


(48) 


1=0 


/ (=0 




_ V 


1'' 




V other = Ye'^^iother 


r^iPi . 

/ Mother 


(49) 


1=0 y 


' 1=0 




V = V centre / W + (1 “ 


\/ n ) V other 


(50) 



3.3 Simulation Experiments 

The above model has been validated through a discrete-event simulator that mimics 
the behaviour of Duato's fully-adaptive routing at the flit level in A:-ary n-cubes. In 
each simulation experiment, a total number of lOOK messages are delivered. Statistics 
gathering was inhibited for the first lOK messages to avoid distortions due to the 
initial startup conditions. The mean message latency is defined as the mean amount of 
time from the generation of a message until the last data flit of the message reaches 
the local PE at the destination node. Numerous experiments have been performed for 
several combinations of network sizes, message lengths, digit-reversal traffic 
fractions, and number of virtual channels to validate the model. However, for the sake 
of specific illustration. Fig. 2 depicts latency results predicted by the proposed models 
plotted against those provided by the simulator for a 8-ary 2-cube (N=S^) and 8-ary 3- 
cube (N=8^) respectively and for different message lengths, B = 32 and 64 flits. 
Moreover, the number of virtual channels per physical channel was set to V=3 and 5 
and the fraction of digit-reversal messages was assumed to be « =0.1 and 0.7. The 
horizontal axis in the figures shows the traffic generation rate at each node while 
the vertical axis shows the mean message latency. The figures reveal that in all cases, 
the analytical model predicts the mean message latency with a good degree of 
accuracy in the steady state regions. However, some discrepancies around the 
saturation point are apparent. However, the simplicity of the model makes it a 
practical evaluation tool that can be used to gain insight into the performance 
behavior of fully adaptive routing in the /c-ary n-cube interconnection network. 

It is worth noting that latency results for different values of a reveal that digit- 
reversal traffic has a little impact on the mean message latency since adaptive routing 
is able to exploit alternative paths of the A:-ary n-cube to route blocked messages, and 
as a result it manages to distribute the traffic load approximately evenly among the 
network channels. 



4 Conclusion 

This paper has presented an analytical model to compute message latency in 
wormhole-switched fe-ary n-cubes with fully adaptive routing in the presence of traffic 
generated by digit-reversal permutations, which are used in many parallel applications 
(e.g., matrix problems and radix-A: EFT computation). Although the model uses 
Duato's fully adaptive routing algorithm, it can equally be adapted for other adaptive 
routing algorithms proposed in [16, 24]. To our best knowledge, this is the first model 
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proposed in the literature that considers non-uniform traffic generated by digit- 
reversal permutation in wormhole-routed k-ary n-cubes. Simulation experiments have 
revealed that the latency results predicted by the analytical model are in good 
agreement with those obtained through simulation experiments. 
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Fig.2- Average latency versus message generation traffic in an 8-ary 2-cube and an 8- 
ary 3-cube for V=3 and 5 virtual channels per physical channel, message length B=32 
and 64 flits, and digit-reversal traffic portions or =0.1 and 0.7. 
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Abstract. Networks of workstations (NOWs) are becoming an increas- 
ingly popular alternative to parallel computers for those applications 
with high needs of resources such as memory capacity and input/output 
storage space, and also for small scale parallel computing. 

Although the mean time between failures (MTBF) for individual links 
and switches in a NOW is very high, the probability of a failure occur- 
rence dramatically increases as the network size becomes larger. More- 
over, there are external factors, such as accidental link disconnections, 
that also can affect the overall NOW reliability. Until the faulty element 
is replaced, the NOW is functioning in a degraded mode. Thus, it be- 
comes necessary to quantify how much the global NOW performance is 
reduced during the time the system remains in this state. 

In this paper we analyze the performance degradation of networks of 
workstations when failures in links or switches occur. Because the rout- 
ing algorithm is a key issue in the design of a NOW, we quantify the sen- 
sitivity to failures of two routing algorithms: up*/down* and minimal 
adaptive routing algorithms. Simulation results show that, in general, 
up*/down* routing is highly robust to failures. On the other hand, the 
minimal adaptive routing algorithm presents a better performance, even 
in the presence of failures, but at the expense of a larger sensitivity. 



1 Introduction 

Networks of workstations are currently being considered as a cost-effective alter- 
native for small-scale parallel computing. In order to achieve a high efficiency, 
the interconnects used in NOWs must provide high bandwidth and low laten- 
cies, usually making use of indirect networks, where communication between 
workstations is provided via several switches connected via one or more links. 
Recent proposals for NOW interconnects, such as Autonet [12], Myrinet [1], and 
ServerNet II [7], are designed in this way. 

Networks of workstations typically present an irregular topology as a conse- 
quence of the needs in a local area network. This irregular topology may span an 
entire building, or even several buildings. On the other hand, because NOWs use 
switch-based interconnects, messages must be routed through several switches 
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until they reach the destination. Routing algorithms consist of the rules that 
messages must follow to go from a source to a destination workstation. 

The design and evaluation of some routing algorithms in networks of work- 
stations with irregular topologies has been previously done in [13]. In this study 
the topology was supposed to be failure free, that is, the possibility that switches 
or/and links could fail was not considered. In fact, switches and links have very 
high values for the mean time between failures (MTBF) parameter. For exam- 
ple, the commercial M2LM-SW16 Myrinet switch with 16 ports has a calculated 
MTBF of approximately 5 x 10® hours (about fifty-seven years) [10]. However, 
the probability of a switch failure increases linearly with the number of switches 
in the network. Therefore, although the value of reliability is very high for the 
Myrinet switch, if network size is large enough, the probability of failure is much 
higher. For example, a 64-switch network built with the previous commercial 
switch will have a global MTBF of 5 x 10®/64 « 7813 hours (about one year). 
On the other hand, in large networks spanning one or several buildings, faulty 
switches are not the only possible source of failures. Switches may be uninten- 
tionally turned off, or even the power supply may fail. Also, it is possible that 
some links may be incorrectly installed, may suffer accidental disconnections, 
and even they can be affected by electromagnetic interferences (EMI). 

Managing switch and/or link failures has been analyzed in the context of 
direct networks with regular topologies [2,8]. In these networks a fault-tolerant 
routing algorithm is needed in order to bypass the faulty region. These algo- 
rithms are usually topology dependent. In the case for networks of workstations, 
where generic routing algorithms must be used due to the topology irregularity, 
managing failures may be faced in a more general way. In these networks, every 
time a new workstation is connected or disconnected to/from the network, it 
is necessary to run a reconfiguration process in order to update routing tables. 
Also, every time a switch is attached/unattached to the network, routing ta- 
bles must be updated. Therefore, managing failures in irregular networks may 
be seen as another instance of the general reconfiguration process. Nevertheless, 
performance of the reconfigured network should be analyzed. 

When a link or a switch fails, network topology changes and thus a reconfigu- 
ration process starts in order to compute the routing rules for the new topology. 
During this process, a reconfiguration algorithm is used in order to make the 
network operational as soon as possible [12,1,3]. Because messages will be dis- 
carded during the reconfiguration phase, it is important that the entire process 
is as fast as possible. However, after the reconfiguration phase, the network will 
have a lower aggregate bandwidth than before the failure. Also, connectivity 
between workstations will be degraded. 

In this paper we will focus on analyzing the impact on network performance 
when functioning in a degraded mode once the network has been reconfigured 
after a failure. A study about how much network performance is degraded would 
be useful to know how user applications see the failure impact and how much they 
are affected by switch or link failures. Because a faulty link or switch modifies 
network topology, it is important to know how much partial modifications in the 
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topology will affect the performance of the routing algorithm used by messages. 
In this way, we have evaluated the sensitivity to failures of two different routing 
algorithms: up*/down* routing [12], and minimal adaptive routing [13]. 

The rest of the paper is organized as follows. Section 2 briefly introduces 
networks of workstations. In Section 3 the routing algorithms evaluated are 
described. Section 4 carries out a performance evaluation. Finally, Section 5 
summarizes the conclusions from this paper. 

2 Networks of Workstations 

Networks of workstations are usually arranged as switch-based networks with 
irregular topology. In these networks each switch is shared by several worksta- 
tions, which are connected to the switch through some of its ports. The remain- 
ing ports of the switch are either left open or connected to other switches to 
provide connectivity between the workstations. Links in a NOW are typically 
bidirectional full-duplex, and multiple links between two switches are allowed. 
Figure 1(a) shows a typical network of workstations. It is assumed in this figure 
that switches have eight ports and each workstation has a single port. 



Fig. 1. A NOW with irregular topology and its corresponding graph 

A routing algorithm must determine the path to be followed by messages. 
Several deadlock-free routing schemes have been proposed for irregular net- 
works [12,11]. Moreover, a general methodology for the design of adaptive routing 
algorithms for irregular networks has been recently proposed in [13]. 

Finally, once a message reaches a switch directly connected to its destination 
workstation, it can be delivered as soon as the corresponding link becomes free. 
Thus, we are going to focus on routing messages between switches, modeling the 
interconnection network / by a multigraph / = G{N,C), where N is the set of 
switches, and C is the set of bidirectional links between switches. Figure 1(b) 
shows the graph for the irregular network in Figure 1(a). 




(a) 



(b) 
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Fig. 2. NOW without 
link nor switch failures 



Fig. 3. NOW after the 
first link (4-1) failure 



Fig. 4. NOW after the 
second link (3-4) fail- 
ure 



3 Routing Algorithms 

In this section, the two routing algorithms analyzed in the performance evalua- 
tion are briefly described. For more details, see [12] and [13], respectively. 

3.1 Up*/down* Routing 

The up*/down* routing scheme provides partially adaptive routing in irregular 
networks. It is based on table lookup, and thus routing tables must be filled 
before messages can be routed. In order to fill these tables, a breadth-first span- 
ning tree on the graph G of the network is first computed. The root switch in 
the tree is the switch which distance to the rest of switches in the network is 
minimum. Routing is based on an assignment of direction to all the operational 
links in the network. In particular, the “up” end of each link is defined as: (1) 
the end whose switch is closer to the root in the spanning tree; or (2) the end 
whose switch has the lower identifier, if both ends are at switches at the same 
tree level. After this assignment, each cycle in the network has at least one link 
in the “up” direction and one link in the “down” direction. To avoid deadlocks, 
routing is based on the following up*/down* rule: a message cannot traverse a 
link in the “up” direction after having traversed a link in the “down” direction. 

Although up*/down* routing provides some adaptivity, it is not always able 
to provide a minimal path between every pair of workstations due to the re- 
striction imposed by the up*/down* rule. As network size increases, this effect 
becomes more important. 

Figure 2 shows the example irregular network depicted in Figure 1. The root 
for the corresponding breadth-first spanning tree is switch 0. The assignment of 
“up” direction to the links in the network is illustrated. The “down” direction is 
along the reverse direction of the link. Figure 3 shows the reconfigured network 
after the failure of the link connecting switches 4 and 1. In this configuration, 
the root of the tree moves from switch 0 to switch 1. Figure 4 shows a new 
reconfigured network after the failure of a second link. In this case, the faulty 
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Table 1. Average number and length of paths between workstations for the 
three irregular topologies employed in the performance evaluation 





Average number of paths 


Average length of paths 


Network size 


up*/down* 


minimal 


up*/down* 


minimal 


Small (8 switches) 


No failures 


1.13 


1.31 


2.19 


2.00 


1 faulty link 


1.65 


1.71 


2.41 


2.35 


2 faulty links 


1.28 


1.39 


2.50 


2.44 


1 faulty switch 


1.50 


1.58 


2.25 


2.17 


2 faulty switches 


1.63 


1.88 


2.25 


2.25 


Medium (32 switches) 


No failures 


1.24 


1.48 


4.61 


3.77 


1 faulty link 


1.21 


1.45 


4.64 


3.84 


2 faulty links 


1.34 


1.55 


4.84 


4.24 


1 faulty switch 


1.38 


1.51 


4.68 


4.01 


2 faulty switches 


1.41 


1.38 


4.77 


4.02 


Large (64 switches) 


No failures 


1.19 


1.40 


5.86 


4.55 


1 faulty link 


1.19 


1.39 


5.88 


4.59 


2 faulty links 


1.19 


1.40 


5.91 


4.64 


3 faulty links 


1.18 


1.40 


5.95 


4.66 


1 faulty switch 


1.17 


1.40 


5.97 


4.65 


2 faulty switches 


1.26 


1.41 


6.05 


4.74 



link was the one connecting switches 3 and 4. Note that the root of the tree has 
changed again. In this case it has moved to switch 5. It must be pointed out 
that, even if the root does not move to another switch after a failure, routing 
tables should be computed again. 

Table 1 numerically shows the effect of the number of faulty links or switches 
on the number of paths between workstations and their lengths when up*/down* 
routing is used in the topology of Figure 1. It can be seen that the first link 
failure increases the average number of up*/down* paths from 1.13 to 1.65. This 
increment in the mean number of paths may seem to be contradictory, since 
one could expect it to decrement instead of increment. Actually, the reason is 
that the mean number of paths has been increased, but at the cost of increasing 
their average lengths. As can be seen in the table, after the first link failure, 
the average path length increases from 2.19 to 2.41. The second link failure 
also increases the average number of up*/down* paths from 1.13 to 1.28, but 
it increases their average length from 2.19 to 2.50. A similar effect occurs when 
switch failures are considered. 

When network size increases, one can see that the effect of failures is similar: 
the mean number of paths changes (increases or decreases depending on each 
particular topology) but the average length of the paths always increases. 
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3.2 Adaptive Routing 

A design methodology for adaptive routing algorithms for irregular networks 
has been proposed in [13]. The aim of this methodology is providing minimal 
routing between every pair of workstations, as well as increasing adaptivity. 
This design methodology can be applied to any deadlock-free routing algorithm. 
In particular, the second routing scheme evaluated in this paper is the result 
of applying this methodology to the up*/down* routing algorithm. Concisely, 
physical channels in the network are split into two virtual channels, called the 
original channel and the new channel. Newly injected messages can only leave the 
source switch using new channels belonging to minimal paths. When a message 
arrives at a switch through a new channel, the routing function gives a higher 
priority to the new channels belonging to minimal paths. If all of them are busy, 
then the up*/down* routing algorithm is used, selecting an original channel 
belonging to a minimal path (if any). If none of the original channels supplied 
provides minimal routing, then the one that provides the shortest path will 
be used. Once a message reserves an original channel, it will be routed using 
only original channels according to the up*/down* routing function until it is 
delivered. 

Table 1 shows how link and switch failures affect the number of minimal 
paths between workstations for this adaptive routing algorithm when used in 
the network of Figure 1. Note that in this case the table reflects the minimal 
topological distances, because when the minimal adaptive routing algorithm is 
used, messages can follow non-minimal paths if they are routed along original 
channels. It can be seen that the first link failure increases the average number 
of minimal paths from 1.31 to 1.71, but the increment of their average length 
ranges from 2.00 to 2.35. The second link failure increases the average number 
of paths from 1.31 to 1.39, also increasing their average length from 2.00 to 2.44. 
Results when switches fail are also displayed. It can be observed that the effect 
of switch or link failures is similar to the case for the up*/down* routing scheme, 
but the mean number of minimal paths remains higher than the mean number 
of up*/down* paths, and also their lengths is noticeably smaller, especially for 
large networks. 

In general, the impact of failures on topology is more significant as network 
size decreases. For example, regarding average path length, which is directly re- 
lated to message latency, the absolute increase is lower for the large network. 
Specifically, for 2 faulty links, the absolute increment when using up*/down* 
routing is 0.31, 0.23, and 0.05, for small, medium, and large networks, respec- 
tively. In the case for minimal adaptive routing, these increments are 0.44, 0.47, 
and 0.09, respectively. If we look at the relative degradation, the increase is even 
less dramatic. 

4 Performance Evaluation 

This section evaluates the impact of link and switch failures in network perfor- 
mance for the two routing algorithms previously presented. Before the analysis 
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of simulation results, we will describe the models used for switches and networks, 
the traffic pattern generation, and the output variables used in the performance 
evaluation. 



4.1 Switch Model 

Each switch has a routing control unit that selects the output channel for a 
message as a function of its destination workstation, the input channel, and 
the output channel status. Table look-up routing is used. The routing control 
unit can only process one message header at a time. It is assigned to waiting 
messages in a demand-slotted round-robin fashion. When a message gets the 
routing control unit but it cannot be routed because all the alternative output 
channels are busy, it must wait in the input buffer until its next turn. A crossbar 
inside the switch allows simultaneous multiple message traversal. It is configured 
by the routing control unit each time a successful route is established. We have 
assumed that it takes one clock cycle to compute the routing algorithm. Also, 
it takes one clock cycle to transmit one flit across the internal crossbar. On the 
other hand, data are injected into the physical channel, which is pipelined, at a 
maximum rate of one flit per cycle. Link propagation delay has been assumed 
to be 4 clock cycles. 

Input buffer size is 32 flits. Since the minimal adaptive routing algorithm pre- 
sented before makes use of two virtual channels, the number of virtual channels 
used by the up*/down* routing scheme has been set to two in order to compare 
both routing algorithms in the same conditions. 



4.2 Network Model 

Network topology is completely irregular and has been generated randomly. How- 
ever, for the sake of simplicity, we imposed three restrictions to the topologies 
that can be generated. First, we assumed that there are exactly 4 workstations 
connected to each switch. Also, two neighboring switches are connected by a sin- 
gle link. Finally, all the switches in the network have the same size. We assumed 
8-port switches: we will suppose that one port remains free, thus leaving 3 ports 
available to connect to other switches. The location of link and switch failures 
is chosen in a random way. The only restriction is that the resulting topology 
after the failures must be connected. 

We have evaluated networks with a size of 8 (small), 32 (medium), and 64 
(large) switches (32, 128, and 256 workstations, respectively). In order to study 
how much the variability in network topology affects its performance, five differ- 
ent irregular topologies have been generated, in a completely random way, for 
each of the three sizes. For each of the topologies, we ran simulations evaluating 
them under the (exactly) same workload. However, although we have studied five 
different topologies for each network size, only results for the most representative 
of them are presented. 
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(a) (b) 

Fig. 5. Message latency versus traffic for a small network 



4.3 Message Generation and Performance Variables 

We considered that message generation rate is exponentially distributed and 
it is the same for all the workstations. Also, we have assumed that message 
destination is randomly chosen among all the workstations in the network. With 
respect to message length, 16-flit and 32-ffit messages have been considered. 
Conclusions in both cases are very similar, so we will present results concerning 
only to the last case. 

The most important performance measures are latency and throughput. Mes- 
sage latency lasts since the message is introduced in the network until the last 
flit is received at the destination workstation, and it is measured in clock cycles. 
Traffic is the flit reception rate measured in flits per cycle. 

4.4 Simulation Results 

Here we present the analysis of results obtained by simulation. The NOW sim- 
ulator [9] has been implemented in the CSIM language [4]. We will refer to the 
up*/down* and minimal adaptive routing algorithms without failures as UD and 
MA, respectively. A suffix like nL will indicate n link failures, whereas mS will 
indicate m switch failures. 



Small Size Networks This kind of networks consists of 8 switches (32 work- 
stations) and 12 links. The first faulty link represents about the 8.5% of the 
total number of links, and the first faulty switch represents exactly the 12.5% of 
the total number of switches in the network. A second failure in a link or in a 
switch will affect about the 9%, and 14%, respectively. Thus, failures in links or 
switches highly affect the network topology, due to its small size. 

As can be seen in Figure 5, the UD routing algorithm is not greatly affected 
by the first faulty link or switch, whereas the second one decreases throughput 
approximately in a 30%. On the other hand, the MA routing algorithm is highly 
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(a) (b) 

Fig. 6. Message latency versus traffic for a medium network 



affected by the first failure, both in a link or in a switch, then increasing net- 
work latency and decreasing the delivered traffic. With a faulty link or switch, 
the maximum throughput of the network is decreased a 25%. The second failure 
significantly affects network performance for both routing algorithms. In partic- 
ular, the second faulty link affects as much as the second failure in a switch. 
This is due to the small size of the network: paths that communicate worksta- 
tions are quite similar for both routing algorithms in the absence of failures, 
both for the mean number of paths and for their average length. When failures 
occur, the adaptivity of the MA routing algorithm decreases drastically, while 
the average length of the paths increases to values similar to those achieved 
by the UD scheme. This causes that the performance of the MA algorithm is 
more affected by failures than the one for the UD scheme. In all the analyzed 
topologies, MA-IL routing scheme always performs slightly better than the UD 
routing algorithm. When a switch failure occurs, MA-IS routing scheme and the 
UD routing algorithm become similar. In any case, the negative effect of failures 
increases with high loads, and for small size networks highly depends on the 
underlying topology. 



Medium Size Networks These networks consist of 32 switches (128 worksta- 
tions) and 43 links. Thus, a faulty link represents about the 2.5% of the total 
number of links, and a faulty switch represents exactly the 3.125% of the amount 
of switches; therefore, a failure in a link or in a switch would not affect the topol- 
ogy as much as in the small size networks. The influence of a second failure in a 
link or in a switch will not be so important in these networks because there are 
several paths connecting the workstations. 

In Figure 6 it can be appreciated that the MA routing algorithm is more 
sensitive to failures than the UD routing scheme: while the latter only presents 
a slight increment in network latency, the former highly increases latency and 
decreases maximum throughput. The reason for the MA routing scheme to be 
more sensitive to failures than the UD algorithm is that as links and switches 
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(a) 



(b) 



Fig. 7. Effect of link failures for another medium size network 



fail, minimal paths increase their lengths, as can be seen in Table 1, and therefore 
the performance of the MA algorithm becomes much worse. This effect is not so 
noticeable in the UD scheme because paths are usually non-minimal. 

For this network size, two different performance zones clearly appear: one for 
the UD routing algorithm and another one for the MA routing algorithm. The 
MA routing algorithm always offers, even with faulty links or switches, better 
performance than the UD routing algorithm. This is because the network size 
allows the MA routing algorithm to take advantage of its degree of adaptivity, 
even when the number of minimal paths is reduced by failures. As can be ex- 
pected, the first switch failure has greater impact on network performance than 
the first link failure, although this effect is more important in the MA routing 
algorithm. 

Finally, as mentioned before, once the network has been reconfigured after 
the failure of links or switches, the new routing tables have nothing to do with 
the previous ones. In fact, they may distribute better message traffic. This is 
the case for one of the studied topologies, where the UD routing algorithm 
obtains some benefit of faulty links, as can be seen in Figure 7. In this case, 
after the first link failure, both latency and throughput are improved because 
the resultant topology balances traffic better. The MA routing algorithm also 
experiments this improvement near network saturation. In this case, for low and 
medium network loads, there is a slight increment in latency, because minimal 
paths are longer due to the faulty link. However, for high network loads the 
performance obtained with one faulty link is better because messages make use 
more intensively of escape paths, which in this case offer better performance. 



Large Size Networks Large networks consist of 64 switches (256 workstations) 
and 96 links. In this case, topology is not as much affected by failures as in the 
case for networks of small and medium sizes. Due to the high number of links in 
the network, we have also considered the failure of a third link. 
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Delivered traffic (fifts/cyde) 




(a) 



(b) 



Fig. 8. Message latency versus traffic for a large network 



Figure 8 shows that the MA routing algorithm is more sensitive to failures 
than the UD routing algorithm, although not as much as in medium size net- 
works, because in these large networks a link or a switch represents a lower 
percentage of the whole network. The UD routing scheme only increases mes- 
sage latency, whereas the MA routing algorithm both increases latency and re- 
duces the maximum delivered traffic. In both routing algorithms, the sensitivity 
to switch failures is higher than to link failures because a greater number of 
minimal paths are lost. In general, the first link failure slightly affects network 
performance, because of network size: it still provides enough alternative paths 
to absorb the network traffic. The worse case occurs when there are two faulty 
switches. 

5 Conclusions 

In this paper, the sensitivity to failures in links and switches of two routing 
algorithms for NOWs has been analyzed. The evaluation study was performed 
on networks with 8, 32, and 64 switches, in order to consider from small size to 
large size networks. Network performance degrades in the presence of failures, 
but this degradation highly depends on the routing algorithm and the network 
size. 

In general, the performance of the up*/down* routing algorithm is slightly 
decreased by failures, only resulting in a low increment in latency. The minimal 
adaptive routing algorithm is more sensitive to failures, leading to an increment 
in latency, and a decrement of the maximum traffic accepted by the network. 
More precisely, and taking into account the size of the network, this study has 
provided the following main insights: 

— For small networks, both routing algorithms are affected in a similar manner, 
mainly due to the few number of paths between switches. A great perfor- 
mance decrease is caused by the first failure in a link or in a switch, with 
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a degree of degradation quite similar in both kinds of failures. The minimal 
adaptive routing algorithm performs better than the up*/down* routing 
scheme, even under the presence of a faulty link. In any case, the underlying 
topology highly affects the conclusions on network performance. 

— When network size increases, the effect of failures on network performance 
decreases, because there are more links, and thus more paths exist between 
switches. Also, faulty links and switches represent a small percentage of 
the network. The existence of a greater number of paths can be used by 
the minimal adaptive routing algorithm to take advantage of its adaptiv- 
ity. Independently from network topology, up*/down* routing performance 
is mainly limited by the early saturation of the links near the root switch. 
Moreover, even in the case of failures, the upper bound of network perfor- 
mance remains limited by the saturation of these links. On the other hand, 
the minimal adaptive routing algorithm is highly sensible to failures because 
the number of minimal paths is reduced, whereas their length is increased. 

— In some of the analyzed topologies, the up*/down* routing algorithm has 
been benefited by a failure. The minimal adaptive routing algorithm can also 
take benefit because it uses the previous routing algorithm for the escape 
paths. Thus, the design of the topology is a very important issue when the 
up*/down* routing algorithm is used. 
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Abstract. In this paper, we study the multi-node broadcast problem in 
hypercubes, that is an asynchronous and repetitive version of all-to-all 
broadcast problem. Suppose that nodes of a hypercube asynchronously 
broadcast a piece of information. They can asynchronously initiate their 
broadcasts while other broadcasts are in process. The multi-node broad- 
cast problem is the problem of completing each of those broadcasts 
quickly. We propose several decentralized schemes for solving the prob- 
lem. The effectiveness of the schemes is demonstrated by simulation. 

Keywords: Parallel processing, broadcast, load balancing, hypercube, 
decentralized communication scheme. 



1 Introduction 

Quick completion of broadcast significantly improves the overall performance of 
many important applications, such as numerical computation and combinatorial 
optimization. This is a main motivation of the study of broadcast problem, that 
has been investigated extensively during the past three decades [2,3]. Unfortu- 
nately however, most of previous studies on efficient broadcast schemes have 
been carried out under “strong” assumptions on the timing of broadcast initia- 
tion to reduce the difficulty of analyses, although in most of real applications, 
broadcasts are initiated asynchronously and repeatedly. 

We challenge to solve this problem in general. Suppose that nodes asyn- 
chronously and repeatedly initiate their broadcasts to distribute a piece of infor- 
mation called token to every other node. The multi-node broadcast problem 
(MBP, in short) is the problem of designing an algorithm for quickly completing 
each of those broadcasts. In this paper, we consider MBP in hypercube. Note 
that hypercube is one of the most popular network topologies for parallel com- 
puters [5] , and design and analysis of efficient routing algorithms on hypercubes 
is one of the most attractive issues to be studied [4] . 

Intuitively speaking, a multi-node broadcast scheme determines the route 
of each token, in a static or dynamic manner. A key point in designing efficient 
multi-node broadcast schemes is how to balance the load of communication links 
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that may vary dynamically depending on the timing of several broadcast initia- 
tions. On the other hand, it is also true that the collection of global information 
about the network traffic will increase the overall cost of the resultant scheme. 
Hence in order to complete each broadcast as quickly as possible, we should 
develop a multi-node broadcast scheme that can adapt itself to the change of 
traffic load by using informations that is “locally” observable by each node. 

The remainder of this paper is organized as follows. Section 2 introduces 
some basic definitions and notation. A simple single-node broadcast scheme on 
hypercubes is also given. In Section 3, several multi-node broadcast schemes 
are proposed. The effectiveness of the schemes is demonstrated by simulation in 
Section 4. Section 5 concludes the paper. 

2 Preliminaries 

2.1 Model 

Let Qn = {V, E) be an undirected binary n-cube, where V is the set of nodes 
representing processors, and E is the set of edges representing bidirectional com- 
munication links between processors. Each node in V corresponds to a binary 
string of length n, i.e., V = {0, 1}", and two nodes u^v G V are connected by 
an edge in if iff u and v differ in exactly one bit. If u and v differ in the bit, 
then edge {u, ri} € E is said to be of dimension i, and denote it as v = (BiU. 

Nodes in V communicate to each other by passing tokens through the com- 
munication network. When a node initiates a (multi-node) broadcast scheme, 
it computes information called routing information which specifies the destina- 
tions and routes the token will flow, and attaches it to the token. Then the token 
is sent out to the first node of each route specified in the routing information. 
Upon receiving a token, a node looks up its routing information, and forwards 
the token to the successor on each path specified in the routing information. 

Each edge is assumed to be full-duplex, and conflicts of tokens at each edge 
are resolved by a conflict resolution strategy (CRS, for short) specified as a part 
of multi-node broadcast scheme. We further assume that, for simplicity, the local 
computation time at nodes is negligible. Nodes, therefore, can send at most one 
token to each of their adjacent nodes, and simultaneously, can receive at most 
one token from each of them in a unit time, called step. Note that tokens sent 
out to different adjacent nodes in a step can be different. 

Finally, the broadcast time of a token is defined to be the maximum elapsed 
time necessary to distribute the token to every other node. 

2.2 Single-Node Broadcast Scheme SIMPLE 

In the multi-node broadcasting, the broadcast time of each token depends on the 
broadcast processes of other tokens. In [I], we have shown that if K (< |U|) nodes 
simultaneously initiate their broadcasts, it takes at least max{n, \{K — l)/n]} 
steps to complete all of the K broadcasts in (hereafter we call it Remark 
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1). In this subsection, we introduce a single-node broadcast scheme that will 
be used as a building block in later sections. In the scheme, called SIMPLE, 
the routing information for a token is given in the form of a permutation over 
{1,2,..., n}. A permutation tt over (1,2,..., n} is described as (ti, i 2 , • ■ • , *n), 
where ij G {1,2, . . . , n} for 1 < j < n, and ij ^ ik if j ^ k. For 1 < j < n, 
denote ij by Tr{j). Let Sn be the set of all permutations over {1,2,..., n}. 

In scheme SIMPLE, each broadcast initiator u first selects a permutation tt 
from Sm and attaches it to the token as the routing information. Then the token 
is broadcast according to the following two rules. 

scheme SIMPLE 

Rule 1: Broadcast initiator u sends a copy of the token to every neighbor 
® 7 t( 2 )W, . . ., ® 7 r(n)W Simultaneously, and terminates. 

Rule 2: If a node v receives a token from node ®,r(i)W for 1 < i < n, then v 
sends a copy of the token to each of neighbors ®,r(i+i)U, ® 7 r(i+ 2 )'i') ■ • ® 7 r(n)^ 

simultaneously, and terminates. □ 

For any given permutation tt (e Sn), scheme SIMPLE completes a single 
broadcast in n steps, which is time optimal [1]. In scheme SIMPLE, for any 
1 < i < j < n, no edge of dimension 7t(j) is followed by an edge of dimension 
7t(i) on any delivery path. Since each delivery path contains at most one edge of 
dimension i for each 1 < f < n, permutation tt completely determines the order 
of edges that occur in each delivery path. 

2.3 Conflict Resolution Strategies 

CRS plays an important role in designing efficient multi-node broadcast schemes. 
Let t{u) be a token waiting to pass through an edge e of dimension i, and tt the 
permutation attached to t{u). The rank of token t{u) at edge e is an integer j 
such that i = 7t(j). The furthest- destination first-serve (FDFS) rule is a CRS 
that selects a token with the smallest rank among all conflicting tokens, where 
a tie is broken by the first-come first-serve (FCFS) rule. In order to minimize 
the maximum broadcast time, in the following, we often introduce the notion of 
deadline to the FDFS rule. Let r be a natural number representing the deadline. 
In the deadlined FDFS, we modify the original FDFS in such a way that if it 
spends r time units after entering a waiting queue, then the token is given the 
highest priority among others in the queue until it is taken out of the waiting 
queue. 

3 Load Balancing Schemes 

3.1 Naive Oblivious Schemes 

The simplest way for realizing a multi-node broadcast is to invoke SIMPLE for a 
fixed permutation tt regardless of the broadcast initiator, and to resolve conflicts 
by the FDFS rule. We call this naive scheme NAIVE. Although it achieves the 
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broadcast initiator 





routing information = (4,5,l,2,7,3^6) 

Fig. 1. Initiation of a broadcast according to scheme SORT 



lower bound on the broadcast time in the best case, in the worst case, it requires 
a time factor of n/2 from a lower bound [1]. A simple (but powerful) way for 
balancing the traffic load in NAIVE is to introduce a randomization. More con- 
cretely, we may modify NAIVE in such a way that the attached permutation is 
randomly and independently selected from set Sn for each broadcast initiator. 
In the following, we call this randomized scheme RANDOM. 

3.2 An Adaptive Scheme Based on Local Information 

In [I], we proposed a non-oblivions multi-node broadcast scheme SORT. In the 
scheme, the selection of a permutation is based on the length of the waiting 
queues incident on the initiator. More precisely, the initiator generates a permu- 
tation by “sorting” the dimensions in the nonincreasing order of the length of 
the outgoing waiting queues. Note that the queue length of an outgoing edge is 
“locally” observable by the initiator and it assumes no global information about 
the network traffic. A formal description of the scheme is as follows (Figure 1 il- 
lustrates a broadcast initiation according to SORT): 

scheme SORT 

1) Preprocessing: Let M = {1, 2, . . . , n} be the set of indices. Let l{i) denote 
the queue length of the outgoing edge of dimension i that is incident on 
the initiator. We first sort set M in the nonincreasing order of £(i). Let 
(*i, * 2 , • ■ • , in) be the resulting list; i.e., ij ^ ik for j ^ k, and i{ij) > 

for 1 < j < n — 1. 

2) Broadcast tokens: Attach permutation (*i, * 2 , ■ • ■ , in) to the token as the 
routing information, and initiate the broadcast according to scheme SIMPLE. 

3) Conflict resolution strategy: Conflicts of tokens are resolved by the 

headlined FDFS rule. □ 
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Fig. 2. Forwarding of token in scheme RESORTl 



3.3 Introduce More Adaptiveness 

In our preliminary experiments [1], it was shown that SORT is more robust than 
RANDOM when there is a spatial imbalance of broadcast initiators. Although 
it seems to be a natural consequence since SORT is non-oblivions and can use 
more information than RANDOM, from the view point of “decentralized” load 
balancing, it is very interesting since in SORT, each initiator only collects “local” 
information before determining the routing information. 

However, it should be too optimistic to assume that the information local 
to the broadcast initiator precisely reflects the global load distribution. Hence, 
it is a natural idea for improving the performance of SORT to allow adaptive 
changes of attached permutation according to the local informations that can 
be observed by the intermediate nodes on the routing path. In the following, 
we propose two ways for such an extension and evaluate their effectiveness by 
simulations: The first way is to change the attached permutation at every node 
on the delivery path. We call it RESORTl (Figure 2 illustrates forwarding of 
token in scheme RESORTl). The second way is to allow adaptive change of 
permutations at most once on any delivery path. We call the resulting scheme 
RES0RT2. Formal descriptions of those two schemes are given as follows. 

scheme RESORTl 

The scheme is obtained from scheme SORT by modifying Rule 2 of the underlying 
single-node broadcast scheme SIMPLE, as follows: 

Step 1: Suppose that a node v receives a token from node for 1 < z < n, 

where tt is the permutation attached to the token as the routing information. 
Step 2: Let M = {7r(z -|- 1), 7 t(z -|- 2), . . . , 7r(n)} be the set of indices that have 
not been used in the delivery path from the initiator to v, and should be 
used in the remaining paths. Sort set M in the nonincreasing order of (.{i), 
where i{i) is the queue length of the outgoing edge of dimension i that is 
incident on v. Let (z^+i, Zi+ 2 , ■ ■ ■ ,'in) be the resulting list. 
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Step 3: We modify the routing information attached to the received token to 

= (7r(l),7r(2), . . . ,7r(i)fyj+ifyi+2, . . . ,fy). 

Step 4: V sends a copy of the token to each of neighbors 07 r'(i+ 2 )^^j 

• ■ • ! ® 7 r'(n)V simultaneously, and terminates. □ 

scheme RES0RT2 

This scheme is obtained from schemes SORT and RESORTl as follows: If the 
distance between v and the initiator is [n/2j , then v acts as in scheme RESORTl; 
otherwise, it acts as in scheme SORT. □ 

4 Simulation 

We conducted simulations in order to evaluate the effectiveness of the proposed 
schemes. All of the following results assume n = 9; i.e., a binary n-cube with 2® = 
512 nodes is considered. Every node repeats a broadcast 30 times, and each 
node can initiate a new broadcast only after completing the previous broadcast 
initiated by the node. The time interval between two consecutive broadcasts is 
randomly selected from set {0,1,..., 5}; i.e., at least 0 step and at most 5 steps. 

We conducted experiments under the following three patterns of initiator 
distribution: i.e., all of the \V\ nodes broadcast (Pattern 1); |P|/2 nodes with 
prefix 0 broadcast (Pattern 2); and |P|/2 nodes with prefix 00 or 11 broadcast 
(Pattern 3). In the next subsection, we demonstrate that if the spatial imbalance 
of broadcast initiators is large enough, then the proposed adaptive schemes could 
give a better performance in terms of the average broadcast time than RANDOM. 
The result also shows that the introduction of a further adaptiveness to SORT 
could not significantly improve the performance of the scheme. 

4.1 Results 

Figures 3 and 4 show average broadcast times for each initiation pattern. In each 
figure, broadcast time is averaged over all initiators and ten broadcasts; e.g., the 
value of the vertical axis at “10 iterations” is the average broadcast time over 
the first 10 broadcasts of all initiators (“20 iterations” corresponds to the next 
10 broadcasts and “30 iterations” corresponds to the last 10 broadcasts). 

By the figures, we have the following observations: In Pattern 1, RANDOM 
and SORT exhibit almost the same performance, and the performance is de- 
graded by allowing more adaptiveness; e.g., RESORTl takes 25 % more broadcast 
time than SORT, and RESORT2 takes 5 % more broadcast time than SORT (see 
Figure 3). In Pattern 2, the performance of RANDOM becomes apparently worse 
than the other schemes; e.g., it takes 40 % more broadcast time than SORT. On 
the other hand, the comparison of three adaptive schemes implies that even in 
such an “imbalanced” situation, RESORTl and RESORT2 cannot beat SORT 
(see Figure 4 (a)). In Pattern 3, RANDOM beats RESORTl, and three schemes 



Decentralized Load Balancing in Multi-node Broadcast Schemes 



249 




Fig. 3. Uniform distribution of broadcast initiators (Pattern 1) 



except for RESORTl exhibit almost the same performance. In fact, RESORTl 
takes 20 % more broadcast time than the other three schemes (see Figure 4 (b)). 

In order to find an explanation of the above phenomena, we observed differ- 
ence of permutations generated by different schemes, in more detail. Note that 
in Pattern 2, the edges of dimension 9 becomes a bottleneck if it is used as the 
latter (or, the last) elements in the generated permutation. Hence in order to 
avoid such a possible bottleneck, the scheme should use those edges as the former 
elements. In fact, the percentage of dimension 9 edges as the first element in the 
generated permutations decreases from 38 % to 27 % by using RESORT2 instead 
of SORT, and to 21 % by using RESORTl. In other words, the force to move di- 
mension 9 to the former elements (in the generated permutation) becomes weak 
by introducing more adaptiveness to the broadcast scheme. 

On the other hand, in Pattern 3, we may avoid bottlenecks caused by two 
dimensions instead of one (note that dimensions 8 and 9 play the same role 
in Pattern 3); i.e., the spatial imbalance of broadcast initiators in Pattern 3 is 
smaller than that in Pattern 2. In fact, the percentage of permutations is almost 
equal, in contrast to Pattern 2. 



5 Concluding Remarks 

In this paper, we proposed several decentralized schemes for solving the multi- 
node broadcast problem. The effectiveness of the schemes is demonstrated by 
simulation. As a result, we show that the multi-node broadcast scheme in which 
each broadcast initiator adapts itself to the traffic load by using a locally ob- 
servable information exhibits the best performance, and the force to balance the 
load becomes weak by introducing more adaptiveness to the scheme. 
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Abstract. The development of fine-grain multi-threaded program ex- 
ecution models has created an interesting challenge: how to partition 
a program into threads that can exploit machine parallelism, achieve 
latency tolerance, and maintain reasonable locality of reference? A suc- 
cessful algorithm must produce a thread partition that best utilizes mul- 
tiple execution units on a single processing node and handles long and 
unpredictable latencies. 

In this paper, we introduce a new thread partitioning algorithm that can 
meet the above challenge for a range of machine architecture models. A 
quantitative affinity heuristic is introduced to guide the placement of 
operations into threads. This heuristic addresses the trade-off between 
exploiting parallelism and preserving locality. The algorithm is surpris- 
ingly simple due to the use of a time-ordered event list to account for the 
multiple execution unit activities. We have implemented the proposed al- 
gorithm and our experiments, performed on a wide range of examples, 
have demonstrated its efficiency and effectiveness. 



1 Introduction 

This paper is a contribution to the development of high-performance computer 
systems based on a fine-grain multi-threaded program execution and architecture 
model. A key to the success of multi-threading is the development of compila- 
tion methods that can efficiently exploit fine-grain parallelism in application 
programs and match them with the parallelism of the underlying hardware ar- 
chitecture. In particular, partitioning programs into fine-grain threads is a new 
challenge that is not dealt with in conventional compiler code generation and 
optimization. 

Our thread partitioning algorithm was developed for the Efficient Architec- 
ture for Running THreads (EARTH), a multi-threaded execution and archi- 
tecture model [8,4]. Under the EARTH model, a thread becomes enabled for 
execution if and only if it has received signals from all the split-phase operations 
that it depends on. Furthermore, threads are non-preemptive: once a thread is 
scheduled for execution, it holds the execution unit until its completion. There- 
fore whenever an operation may involve long and/or unpredictable latencies. 
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Fig. 1. Architecture abstract model. 



the role of a compiler (or programmer) is to make the operation “split-phase”. 
We call this requirement the split-phase constraint, and assume that such con- 
straints are explicitly identified and presented to the thread partitioner. The 
thread partitioning problem studied in this paper can be stated as follows. 

Given a machine model M and a weighed data dependence graph G with 
some nodes labeled as split-phase nodes, partition G into threads such 
that the total execution time of G is minimized subject to the split-phase 
constraints. 

The main contribution of this paper is the development of an heuristic thread 
partitioning algorithm suitable for a machine model that allows multiple execu- 
tion units in each processing node.^ Unlike previous related thread partitioning 
algorithms, ours faces a new challenge: the existence of more than one thread 
execution unit per node implies a trade-off between the need to generate enough 
parallel threads per node to utilize these execution units, and the need to assign 
related operations to the same thread to enhance locality of access. 

2 Machine Model 

Our architecture model is presented in Figure 1. Each processing node has N 
execution units (EU) and one synchronization unit (SU). Both the EU and the 
SU perform the functions specified in the EARTH Virtual Machine [8] . Threads 
that are ready for execution are placed in the ready queue that is serviced by the 
EUs. When an active thread performs a synchronization operation or requests a 
long latency data transfer, the request for such a service is placed in the dispatch 
queue. The SU is responsible for the communication with all other processing 
nodes and for the synchronization between threads within the node. 

^ Notice that the algorithm presented in [7] collapses as much local computation as 
possible into a single thread, thus making it inadequate for the machine model 
studied in this paper that has multiple execution units per processing node. 
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In the model of Figure 1 each processing node has its own memory hierarchy, 
but an EU can access memory locations in any node of the machine.^ An access 
to a location in the local memory hierarchy, a local access, has a lower latency 
and higher bandwidth than a remote access. As in [7], we assume that a cost 
model is provided that allows for an estimation of the cost of all local and all 
remote operations required in a program statement. We use S to represent the 
cost associated with the termination of a thread and the start of the execution 
of another thread. Although we assume that a ready thread can be executed by 
any one of the local processors, to favor data locality and benefit from the local 
caches in the architecture, the partitioning algorithm takes into consideration 
the amount of dependencies among statements when placing them in different 
threads 



3 Thread Partition Cost Model 



We assume that a program is written in a sequential language augmented with 
high level-parallel constructs, and that the data has been partitioned among 
the memory modules in the processing nodes of the machine. Therefore, given a 
program statement we can determine whether the statement is local or remote. 
We also assume that the program has been translated into a Data Dependence 
Graph (DDG). 

Thus the program is represented by a graph G{V, E) where each node in V 
represents a collection of program statements. A node can be a simple node such 
as an assignment statement or a compound node such as a loop. If the execution 
of the program statements requires accesses to a remote memory module, the 
node is a remote node otherwise the node is a local node. Each edge (u,, Vj) in 
E represents a data dependency from u, to Vj . An edge departing from a remote 
node is a remote edge, and an edge departing from a local node is a local edge. 
Like in [7] we represent E by an adjacency matrix C defined as: 



Cij = 



1 if {vi,Vj) G E 
0 otherwise 



The partitioning algorithm is based on a cost model that associates a lo- 
cal cost cf and a remote cost cf to each node u, that represent the number of 
cycles that the nodes spends in the EU and the number of cycles that elapses 
between the request and the completion of a split-phase transaction started by 
the node. One of the main goals of the partitioning algorithm is to decide when 
it is advantageous to dispatch the request, terminate the current thread and to 
start the execution of a new thread when the remote operation is completed. 
For some interconnection networks a model to predict the network performance 
will be necessary because the remote cost is affected by the load in the network. 
For the experiments with our partitioning algorithm, we assume that this cost 

^ If the underline machine does not allow direct accesses to remote memory, the 
EARTH systems emulates a global address space. 
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is bounded by a constant. The cost model presented in this paper can be con- 
structed for any machine through profiling experiments and examination of the 
machine specification for the number of cycles required to execute each class of 
instruction. The network latency and bandwidth can also be readily measured. 
For our experiments we use the values obtained by Theobald for the EARTH- 
MANNA implementation [8]. 



4 Problem Formulation 



Assume that at runtime threads are selected for execution from a single queue 
of ready threads using an efficient scheduler. Given a DDG G with each node 
Vi £ G annotated with its local cost cf and its remote cost cf , and a constant 
thread switching cost S, the thread partitioning problem is the problem 
of finding a thread partition P that meets two goals: (1) minimizes the total 
execution time, (2) maximizes the affinity between nodes assigned to the same 
thread. The affinity of a node n, of G to a thread of P, A{vi,Tk), is given by 
the ratio between the number of dependences between nodes of and n,, and 
the total number of incoming edges of n, . 



A{vi,Tk) 



^vjen 



Observe that if all the incoming edges of n, are from nodes in , then A{vi ,T^) = 
1. On the other hand, if none of the incoming edges of n, are from nodes in T/., 
then A{vi,Tj.) = 0. 

Goals (1) and (2) should be pursued under the constraint that nodes con- 
nected by a remote edge are assigned to different threads. Goal (1) is the prin- 
cipal goal of the partition algorithm, while goal (2) is necessary to favor locality 
of access because the abstract model assumes that all processors have equal 
probability of fetching a thread from the ready queue. 

The thread partitioning algorithm uses the affinity function to decide in which 
thread to place a node of the DDG. The algorithm keeps a partial schedule of 
the threads already formed and searches into this schedule for the best place to 
insert a node from the DDG. To minimize the searching time, an event list is used 
to store the starting and finishing time of each thread. A detailed description of 
the thread partitioning algorithm including an example is presented in [1] . 



5 Experimental Results 

We use the Thread Partition Test Bed presented in [7] to generate random 
DDGs to test the partition algorithm. We vary several properties of the DDGs 
generated, including the number of nodes, the average number of outgoing edges 
from a node, the percentage of remote nodes in the graph and the distribution 
of local and remote costs in the nodes. 
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The distribution of execution costs for the nodes is as follows: A local node 
can be of three types: 1) Local I/O (20%), 2) Local function call (10%), 3) 
Other(math,etc.) (70%). A local I/O is assigned a cost of 10 cycles. A local 
function call is assigned 10 cycles and a node of other type has 3 cycles. Remote 
nodes can be of type: 1) Remote I/O (80%), 2) Remote function call (20%). A 
remote I/O is assumed to take 300 cycles, and the costs of remote function calls 
are uniformly distributed between 400 cycles and 4000 cycles. This distribution 
of node types and the degree of the DDG generated are based on static profiling 
of EARTH-C benchmarks [7, 3] 

5.1 Summary of Main Experimental Results 

The main results of our experiments can be summarized as follow. 

Absolute Efficiency Our new partition algorithm is very efficient for a wide 
range of DDGs on all the machine models except the EARTH-Dual model 
where only a single EU is available per node and there is a high cost associ- 
ated with thread switching. The algorithm performs remarkably well when 
the machine model has multiple execution units - e.g. for SMP and SCMP 
models - the average absolute efficiency is above 99%. 

Effectiveness of Search Heuristics The use of an event list and of a time 
line schedule results in an effective search for the placement of a new node 
(see [1] for details). 

Latency Tolerance Capacity The algorithm is robust to variations in la- 
tency. In our experiments a remote operation latency varied between 400 and 
4000 cycles. As shown in Table 1 the partition algorithm produces thread 
partitions that are able to tolerate these varied latencies. 



5.2 Machine Architectures 

We define four different machine architectures for our experiments. MANNA is 
a multiprocessor machine with 40 processors distributed in 20 processing nodes 
interconnected by a crossbar switch. MANNA is the first platform in which the 
EARTH model was implemented [5] . 

EARTH-MANNA-DUAL: An implementation of the EARTH architecture 
on the MANNA machine. The second processor is used for the SU func- 
tion. Therefore, according to our model, this is a machine with a single EU 
per node. The thread switching cost in this machine is <5 = 36 cycles (see 
measurements reported in [8]). 

EARTH-MANNA-SPN: The two processors in the machine are used to im- 
plement the SU functions. Therefore, this machine has two EUs per process- 
ing node. The thread switching cost is <5 = 16 (see measurements reported 
in [8]). 

EARTH- SU: This is the EARTH architecture with a custom hardware SU. 
We consider a machine with a single execution unit per processing node and 
with a thread switching cost <5 = 2 (see measurements reported in [8]). 
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SMP: This is an Symmetric Multi-Processor machine with 4 processors per 
node. In such a machine we expect the thread switching cost to be similar 
to the one for the MANNA-SPN, therefore we will use <5 = 16. 

Single-Chip Multi-threaded Processor (SCMP) In this case we consider 
a hypothetical machine that has multiple functional units with multi-threading 
support. We assume a machine with 8 EUs and with a thread switching cost 
of (5 = 10. 



5.3 Measuring Absolute Efficiency 

An optimal partition cannot result in an execution time that is shorter than 
the critical path of the program or shorter than the total amount of work to be 
performed divided by the number of EUs available. Thus, a lower bound for the 
execution time is given by: 

rfi t n~^ t \ 

^lowest — — j 

where is the length of the critical path, the sum of the local cost cf of all 
nodes is the total amount of work to be performed by the program, and N is the 
number of EUs in the machine. We define the absolute efficiency as the ratio: 



^ — '^lowest i'^end 

where T^nd is the execution time for the program with the thread partition 
produced by our algorithm running under an efficient FIFO scheduler. E = 100% 
means that the partition algorithm found a partition that can result in the 
optimal execution time. 

To measure the efficiency of the algorithm, we varied the percentage of remote 
nodes in the randomly generated DDG from 25% to 75%, and the size of the 
graph from 10 nodes to 1000 nodes. Each graph has three times as many edges 
as the number of nodes. Then we generated twenty distinct random DDGs and 
applied the partition algorithm to each one of them. We computed the average 
execution time for each run, compared it with Tiowest, and present the average 
efficiency in Table 1. The algorithm did remarkably well for machines with four 
or eight execution units per processing node (SMP and SCMP). The algorithm 
also did quite well both for the EARTH-SU that has a single EU and a very low 
thread switching cost and for the EARTH-MANNA-SPN that has two EUs per 
processing node. The results for graphs with a large number of nodes for the 
EARTH-MANNA-DUAL are not as good. This should not come as a surprise 
because this architecture has a single EU and very large thread switching costs. 



6 Related Work 

The thread partition problem for multi-threaded architectures is similar to the 
task partitioning and scheduling problem [6, 9]. In both problems a program has 
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99.9 
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100.0 



Table 1. Average partition algorithm efficiency. 



to be divided into smaller pieces with respect to some constraints (dependencies). 
The focus of task partition is to allocate N tasks onto M processors in order to 
reduce the total execution time. This problem is often represented as a graph 
partition problem in which nodes denote tasks and edges represent two types 
of constraints: precedent constraint, i.e., one task must complete before another 
task can start; and communication constraint, i.e., data must be exchanged be- 
tween two tasks. When the communication constraint is taken into consideration 
the task partitioning problem is an NP-complete problem [2]. In this case, the 
optimization goal is often reduced to minimize the total communication [6] . For 
further discussion of related work we refer to [7] and [1]. 



7 Conclusion 

We designed, implemented, and evaluated an efficient, effective, and robust algo- 
rithm to partition a program into threads for the case in which multiple execution 
units are available in each processing node of a parallel architecture. The algo- 
rithm is efficient because it generates a partition that results in an execution 
time that is very close to the best possible execution time determined by the 
length of the critical path and the total amount of computation existing in the 
program. The algorithm is robust because it worked efficiently for a varied set of 
architectures and a wide range of latencies between processing node. The algo- 
rithm is effective because it employs a data structure associated with a searching 
algorithm that reduce the time complexity of the algorithms. On our experimen- 
tal framework we tested the algorithm with several thousand data dependency 
graphs with up to a thousand nodes and several thousand connections. 
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Abstract. NOWs are arranged as a switch-based network which allows the lay- 
out of both regular and irregular topologies. However, the irregular pattern in- 
terconnect makes routing and deadlock avoidance quite complicated. Current 
proposals use the up* /down* routing algorithm to remove cyclic dependencies 
between channels and avoid deadlock. Recently, a simple and effective method- 
ology to compute up* /down* routing tables has been proposed by us. The re- 
sulting routing algorithm is very effective in irregular topologies. However, its 
behavior is very poor in regular networks with orthogonal dimensions. There- 
fore, we propose a more flexible routing scheme that is effective in both regular 
and irregular topologies. Unlike up* /down* routing algorithms, the proposed 
routing algorithm breaks cycles at different nodes for each direction in the cycle, 
thus providing better traffic balancing than that provided by up* /down* routing 
algorithms. Evaluation results modeling a Myrinet network show that the new 
routing algorithm increases throughput with respect to the original up* /down* 
routing algorithm by a factor of up to 3.5 for regular networks, also maintaining 
the performance of the improved up* / down* routing scheme proposed in [7] 
when applied to irregular networks. 

Keywords: Networks of workstations, regular and irregular topologies, routing 
algorithms, deadlock avoidance. 



1 Introduction 

NOWs are arranged as a switch-based network which provides the wiring flexibility, 
scalability, and incremental expansion capability required in this environment. In order 
to achieve high bandwidth and low latencies, NOWs are often connected using gigahit 
local area network technologies. There are recent proposals for NOW interconnects like 
Autonet [9], Myrinet [1], Servernet II [4], and Gigabit Ethernet [10]. 

Switch-based network allows the layout of both regular and irregular topologies. 
However, regular networks are often used when performance is the primary concern 
[6]. On the other hand, the irregular pattern interconnect makes routing and deadlock 
avoidance quite complicated. Current proposals use the up* / down* routing algorithm. 
Others routing schemes have been proposed for NOWs, like adaptive-trail routing [5], 
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minimal adaptive routing [11], and smart-routing [2]. However these routing algorithms 
have not been considered due to their limited applicability or their high computational 
cost. 

In up* ! down* routing [9], a breadth-first search spanning tree (BFS) is computed. 
This algorithm is quite simple, and has the property that all the switches in the network 
will eventually agree on a unique spanning tree. A direction (“up” or “down”) is as- 
signed to each network link, based on the position in the spanning tree, and messages 
are routed through sequences of “up” or “down” channels. The “down”/“up” transition 
is forbidden. As a consequence, in most cases, up* /down* routing does not always sup- 
ply minimal paths between non-adjacent switches, becoming more frequent as network 
size increases. Recently, it has been proved that making a different assignment of direc- 
tion to links may lead to a significant increase in the number of minimal paths followed 
by messages. The methodology to compute routing tables is based on obtaining a depth- 
first search spanning tree (DFS) instead of a BFS spanning tree [7]. This methodology 
is very efficient in irregular networks. However in regular networks such as meshes or 
hypercubes, its behavior is noticeably worse than that of the dimension-order routing 
algorithm (DOR). 

In this paper, we propose a more flexible routing scheme that is effective in both 
regular and irregular network topologies. It is based on computing a DFS spanning tree 
to break cyclic dependencies, like the algorithm proposed in [7]. However, unlike in 
up* / down* routing schemes, the removal of cyclic channel dependencies can be done 
separately for each direction in each cycle, thus allowing us to achieve better traffic 
balancing in most network topologies. 

The rest of the paper is organized as follows. In Section 2, the up* / down* routing 
scheme and the methodologies to compute its routing tables are described. In Section 3, 
a flexible routing scheme that is effective in both regular and irregular networks is pro- 
posed. Section 4 shows performance evaluation results for the new routing algorithm. 
Finally, in Section 5 some conclusions are drawn. 



2 Up* /Down* Routing 

U p* / down* routing is the most popular routing scheme currently used in commercial 
networks, such as Myrinet [1], valid for networks with regular or irregular topology. In 
order to compute up* / down* routing tables, different methodologies can be applied. 
These methodologies are based on an assignment of direction (“up” or “down”) to the 
operational links in the network by building a spanning tree. These methodologies differ 
in the type of spanning tree to be built. One methodology is based on a BFS spanning 
tree, such as it was proposed in Autonet [9], whereas another methodology is based on 
a DFS spanning tree, as it was recently proposed in [7]. 

In networks without virtual channels, the only practical way of avoiding deadlock 
consists of restricting routing in such a way that cyclic channel dependencies ' are 
avoided [3]. To avoid deadlocks while still allowing all links to be used, up* / down* 
routing uses the following rule: a legal route must traverse zero or more links in “up” 
direction followed by zero or more links in “down” direction. Thus, cyclic channel 

* There is a channel dependency from a channel a to a channel Cj if a message can hold a and 
request Cj . In other words, the routing algorithm allows the use of Cj after reserving a . It will 
be represented as a — )■ Cj 
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dependencies are avoided by imposing routing restrictions, because a message cannot 
traverse a link along the “up” direction after having traversed one in the “down” direc- 
tion. Next, we describe how to compute both a BFS and DFS spanning tree, and how to 
assign a direction to the links. 

2.1 Computing a BFS Spanning Tree 

First, to compute a BFS spanning tree, a switch must be chosen as the root. Starting 
from the root, the rest of the switches in the network are arranged on a single spanning 
tree [9]. Then, an assignment of direction (“up” or “down”) to links is performed. The 
“up” end of each link is defined as: 1) the end whose switch is closer to the root in 
the spanning tree; 2) the end whose switch has the lower identifier, if both ends are at 
switches at the same tree level. The result of this assignment is that each cycle in the 
network has at least one link in the “up” direction and one link in the “down” direction. 

2.2 Computing a DFS Spanning Tree 

Like in the BFS spanning tree, an initial switch must be chosen as the root before 
starting the computation of a DFS spanning tree. The selection of the root is made by 
using heuristic rules. The rest of the switches are added following a recursive procedure. 

Unlike in the BFS spanning tree, adding switches to build the path is made by using 
heuristic rules. We apply the heuristic rule recently proposed in [8]. Starting from the 
root switch, the switch with a higher number of links connecting to switches that already 
belong to the tree is selected as the next switch in the tree. In case of tie, the switch with 
higher average topological distance to the rest of the switches will be selected first. 

According to [8] the root switch is selected after computing all the DFS spanning 
trees and selecting one of them based on two behavioral routing metrics: (1) the average 
number of links in the shortest routing paths between hosts over all pairs of hosts, 
referred to as average distance', and (2) the maximum number of routing paths crossing 
through any network channel, referred to as crossing paths. We first compute the metrics 
for each DFS spanning tree obtained by selecting the root among all the switches in the 
network. Finally, the switch selected as the root will be the one whose DFS spanning 
tree provides the lower value for the crossing paths metric. In case of tie, the switch 
with the lower value for the average distance metric will be selected. Therefore, the 
selected switch will be the one that allows more messages to follow minimal paths and 
provides better traffic balancing. 

Next, before assigning direction to links, switches in the network must be labeled 
with positive integer numbers. A different label is assigned to each switch. When as- 
signing directions to links, the “up” end of each link is defined as the end whose switch 
has the higher label. 



3 Flexible Routing Scheme 

Our interest on a flexible routing scheme that is effective in both regular and irregular 
networks is due to the following reasons: (1) in some cases, for performance reasons, 
it may be advisable to use regular networks; (2) in spite of using regular networks, 
links may fail or components may be added/removed in a NOW environment. Thus 
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(a) 
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Fig. 1. (a) Link direction assignment and cyclic channel dependency removed by using the 
up* / down* routing scheme, and (b) independent removal of cyclic channel dependencies for 
the two directions in the cycle. 



[a,d][a,c] 




Fig. 2. Balancing paths by removing additional channel dependencies. 



recomending the use of generic routing algorithms suitable for the resulting topology. 
(3) often, switch arrangement exhibits a certain degree of regularity or hierarchy, but 
this may not be enough to allow specific routing algorithms for regular networks to be 
used. 

As shown in Section 2, up* / down* routing algorithms remove cyclic channel de- 
pendencies by breaking cycles at the same node for both directions of each cycle. That 
is, if the dependency between the channels c, and Cj is removed, then the channel de- 
pendency between the channels Cj and c, will be removed too. We have observed that 
this constraint may cause an uneven distribution of traffic in some network topologies. 

To ilustrate this idea, consider the simple 4-switch network depicted in Figure 1(a). 
Solid arrows represent the “up” direction assigned to each link by the up* / down* rout- 
ing algorithm. Also, removed channel dependencies are shown in dashed arrows. Each 
routing path crossing a channel is represented by [x,y], where x and y represent the 
source and destination switches of the routing path, respectively. Every routing path is 
computed by selecting a single path between every pair of switches, thus minimizing 
the number of routing paths crossing each channel. We can observe that the up* ! down* 
routing algorithm unevenly distributes the routing paths among the channels, since there 
are some channels crossed by 3 routing paths, whereas other channels are crossed by a 
single routing path. 

As can be observed, up* ! down* routing imposes hard constraints to remove chan- 
nel dependencies. However, by removing channel dependencies independently for each 
direction in each cycle a more even distribution of traffic can be achieved. In Eigure 
1(b) we can observe that the number of dependencies removed from the network is the 
same as the number of dependencies removed in Eigure 1(a). However, the routing re- 
strictions are independent for the two directions of the cycle. Also, we can observe that 
the new removal of cyclic channel dependencies decreases the maximum number of 
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routing paths crossing every channel in the network down to 2. Moreover, an even dis- 
tribution of the routing paths is achieved. Furthermore, the same distribution of routing 
paths shown in Figure 1(b) can be achieved even if additional routing restrictions are 
imposed to the network, as can be seen in Figure 2. 



3.1 Avoiding Deadlocks 

To ease the identification of cyclic channel dependencies in the network graph, once a 
DFS spanning tree has been computed, a different label is assigned to each switch, as 
described in Section 2.2. The label assigned to each switch allows us to detect cycles 
in the network graph, because every cycle will have a single switch a that satisfies the 
condition L(b) > L(a) < L{c), where L{x) is a function that returns the label assigned 
to switch X, and b, a, and c are adjacent network switches. 

Next, all the switches are visited in decreasing label order to break cycles in the 
network graph. The switch with a higher label will be visited first. At every switch, the 
following rules are applied to remove channel dependencies, assuming that the switch 
currently visited is Vi and L{vi) = i: 

(Ri) if the switches Vi, vu, Vr, and vj form a cycle, such that i > k and i > j, 
the channel dependencies^ Ck,i Cij, Ci^k Ck,r, Cr,k, and cj^i are 

removed, as can be seen in Figure 3(a). 

(R 2 ) if there exist switches vj andr;^;, such that they are adjacent to switch r;,, where 
k > i < j and k < j, the dependency Ck i ^ Ci j is removed, as can be seen in Figure 
3(b). 

(i?s) if there exist switches Vj and Vk, such that they are adjacent to switch Vi, 
where k > i < j, k < j, and all of them form part of the same cycle, the dependency 
Cj,i Ci^k is removed, as can be seen in Figure 3(c). 




Fig. 3. Channel dependencies removed by applying (a) rule (JZi ), (b) rule (R 2 ), and (c) rule (Rs). 



The new routing scheme is deadlock-free, since the rules {R 2 ) and (Ra) guarantee 
that all the cycles in the network graph are broken. Note that some cycles in the network 
graph could be broken by the rule (i?i). However the aim of this rule is to remove 
additional channel dependencies in order to achieve better traffic balancing. 

Nevertheless, notice that the network may become disconnected after applying rule 
(i?i ). This is because, unlike rules (R 2 ) and (i?s ), rule (i?i ) may remove channel depen- 
dencies between channels belonging to the DFS spanning tree. Therefore, after applying 

^ The unidirectional channel that links the switch Vi to the switch Vj is denoted as Ci,j . Similarly, 
the unidirectional channel that links the switch Vj to the switch Vi is denoted as Cj,i. 
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rules (-Ri), {R 2 ), and (Rz) to every switch in the network, we must apply the following 
rule to guarantee that the network remains connected: 

(Ra) if there is no routing path to reach switch Vj from switch Vi, and: (a) i > j, 
restore the dependencies cu+i,k Ck,k-i, : i — 1 > k > j + 1. (h) i < j , restore 
the dependencies Ck+i,k Ck,k-i, VA: : j + 1 < k < m — 1, where m is the label 
associated to switch Vm, which in turn is adjacent to switch Vn, i < n < j < m. Note 
that the restored dependencies are always established between channels belonging to 
the DFS spanning tree. 

Once all the switches in the network have been visited, the routing tables will be 
filled with all the shortest paths between every pair of switches. 



4 Performance Evaluation 

In this section, we evaluate by simulation the performance of the new routing algorithm 
proposed in Section 3 (FX^DFS). For comparison purposes, we have also evaluated 
the up* ! down* routing algorithms based on both BFS and DFS spanning tree, as de- 
scribed in Section 2 (UD-BFS and UD-DFS, respectively). Also, dimension-order 
routing (XY) has been evaluated for regular networks, such as meshes. In order to obtain 
realistic simulation results, we have used timing parameters for the switches taken from 
a commercial network. We have selected Myrinet because it is becoming increasingly 
popular due to having very good performance/cost ratio. 



4.1 Network and Switch Model 

Irregular network topologies have been generated randomly. Network sizes of 8, 16, 32, 
and 64 switches have been evaluated. We have generated ten different topologies for 
each network size analyzed. The maximum variation in throughput improvement of 
FX-DFS routing with respect to U D-DFS routing is not larger than 10%. Results 
plotted in this paper correspond to the topologies that achieve the average behavior 
for each network size. Also, we have generated regular networks, like 2-D meshes and 
2-D tori, with network sizes of 4x4 and 8x8 switches. For space reasons only some 
significantly results are plotted. 

For both regular and irregular networks, we assume that every switch in the network 
has 8 ports, using 4 ports to connect to workstations and leaving 4 ports to connect to 
other switches. For message length, 32-flit and 512-flit messages were considered. A 
uniform message destination distribution has been used. 

The path followed by each message is obtained using table-lookup at the source 
host, very much like in Myrinet networks. Therefore, deterministic source routing is 
assumed. Wormhole switching is used. Flits are one byte wide and the physical channel 
is one flit wide. 



4.2 Simulation Results 

Figures 4(a) and 4(b) show the average message latency versus accepted traffic for bofh 
8x8 fori and mesh, respectively. As can be seen, for 512-flil messages FX^DFS 
achieves a noticeable improvement in both latency and throughput with respect to 
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Fig. 4. Average message latency vs. accepted traffic. 8 x 8 (a) torus and (b) mesh. Uniform 
distribution. Message length is 512 flits. 
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Fig. 5. Average message latency vs. accepted traffic. Irregular topology of (a) 8 and (b) 64 
switches. Uniform distribution. Message length is 512 flits. 



UD^BFS. Throughput improvement achieves a factor of 3.5 for a 8 x 8 torus. More- 
over, FX-DFS achieves the same behavior than the XY for 2D meshes and increase 
significantly the performance with respect to UD^BFS and UD^DFS. Ina2-Dtorus 
and mesh networks, all of the evaluated routing strategies allow all messages to be 
routed through minimal paths. Therefore, the difference in performance between them 
can only be due to the differences in traffic balancing achieved by them. We can ob- 
serve that better traffic balancing can be obtained by using up* / down* routing based 
on a DFS spanning tree. Moreover, traffic balancing in the network can be improved by 
removing channel dependencies as proposed in Section 3 (FX^DFS). 

On the other hand, the poor behavior of U D-BFS is due to the fact that it tends 
to concentrate traffic in the channels close to the root switch of the BPS spanning tree. 
Therefore, these channels become the bottleneck of the network, leading to saturation 
at relatively low traffic, especially in large nefworks. 

On the other hand, Figures 5(a) and 5(b) show the average message latency ver- 
sus accepted traffic for irregular nefworks with 8 and 64 switches, respectively. Mes- 
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sage size is 512 flits. In general, we can observe that the behavior of FX^DFS is 
almost identical to that of UD-DFS for large networks. As shown in [8], the im- 
provement in performance of UD^DFS with respect to UD^BFS is mainly due 
to the fact that UD^DFS allows most messages to follow minimal paths, especially 
in large networks. However, for very small networks, FX-DFS significantly out- 
performs UD^DFS. The improvement in throughput of FX^DFS with respect to 
UD^DFS is about 20%. This is only due to the fact that FX^DFS achieves better 
traffic balancing. Note that UD-DFS hardly improves performance with respect to 
U D-BFS, because in small networks it is likely that messages follow minimal paths. 

5 Conclusions 

In this paper, we have proposed a new routing scheme forNOWs that, unlike up* ! down* 
routing strategies, is effective in both regular and irregular network topologies. Like the 
improved up* j down* routing scheme recently proposed in [7], the new routing algo- 
rithm is also based on computing a DFS spanning tree to break cyclic dependencies. 
However, the removal of channel dependencies is performed in a more flexible way 
than when applied to up* / down* routing, allowing us to achieve better traffic balanc- 
ing in most network topologies. Moreover, the new routing algorithm does not require 
new resources to be added to the network. Only routing tables need to be updated. 

Evaluation results modeling a Myrinet network under uniform traffic show that the 
new routing algorithm significantly outperforms up* / down* routing strategies when 
it is applied to regular and small irregular networks. In particular, the proposed routing 
scheme improves throughput with respect to the original up* / down* routing scheme by 
a factor of up to 3.5 for 2-D tori. Moreover, for 2-D meshes, the new routing algorithm 
exhibits the same behavior as the dimension-order routing algorithm (DOR), improving 
throughput with respect to the original up* / down* routing strategy. 
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Abstract. The execution performance of Java has been a problem since it was 
introduced world wide. As one of the solutions, a bytecode instruction folding 
process for Java processors was developed in a PicoJava model and a Producer, 
Operator and Consumer (POC) model. Although the instruction folding process 
in these models saved extra stack operations, it could not handle certain types of 
instruction sequences. In this paper, a new instruction folding scheme based on a 
new, advanced POC model is proposed and demonstrates improvement in byte- 
code execution. The proposed POC model is able to detect and fold all possible 
instruction sequence types, including a sequence that is separated by other byte- 
code instructions. SPEC JMV98 benchmark results show that the proposed POC 
model-based folder can save more than 90% of folding operations. In addition, a 
design of the proposed POC model-based folding process in hardware is much 
smaller and more efficient than traditional folding mechanisms. In this research, 
the proposed instruction folding technique can eliminate most of the stack oper- 
ations and the use of a physical operand stack, and can thereby achieve the per- 
formance of high-end RISC processors. 



1. Introduction 

It has already been more than four years since Java was introduced and widely 
adopted in the Internet arena as well as in consumer electronics and communication 
systems. Java started with good initiatives, and it was well equipped with many prom- 
ising features for its future, but it now faces a few well-known problems, especially in 
performance. 

There have been several attempts to implement the Java Virtual Machine (JVM) in a 
hardware chip. PicoJava [1] and JEMl [2], for instance, have been released as com- 
mercial products. A smaller-scale version of JVM in a Field Programmable Gate Array 
(FPGA) chip has been introduced for research purposes [3]. These hardware 
approaches, with direct implementation of JVM, have clearly improved Java perfor- 
mance, since there is no bytecode interpretation process involved. Although there are 
many aspects of bytecode optimization as proposed in [4], the optimization of stack 
operation is the most practical approach of all. 

The underlying operation of Java bytecode instructions is purely stack oriented. Due 
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to inefficiencies of index addressing on the operand stack, the Java processors execute 
15% to 30% more instructions, compared to register-based processors [5]. The prob- 
lem can be eliminated by a process called “instruction folding.” The instruction fold- 
ing process combines several stack manipulation instructions into a single register- 
based instruction, resulting in less CPU cycles for the same amount of work. 

In this paper, a new instruction folding algorithm based on an advanced Producer, 
Operator and Consumer (POC) model is proposed. New and improved features of the 
model will be introduced and explained in detail. Also, the hardware implementation 
of the folding unit will be illustrated, and its performance, compared to traditional 
folding units, will be shown in the results section. 

SPEC JMV98 benchmark programs [6] were selected to collect bytecode data to 
analyze the folding behaviors. In general, over 90% of the folding operations can be 
eliminated by a new POC model-based folding technique, as shown later in the experi- 
mental result section. 

2. Related Work 

The instruction folding technique was first introduced in the PicoJava I architecture 
[7] in 1996. According to the simulation results, the PicoJava 1 folding operation could 
fold up to 60% of the stack operations [8]. Research shown in [4] applied similar fold- 
ing techniques as PicoJava II, and illustrated more advanced folding groups and check- 
ing patterns. Finally, a POC model technique based on the characteristics of bytecode 
instructions has been developed and demonstrated in [9]. 

3. Traditional Instruction Folding Techniques 

The folding techniques demonstrated by [4] and [5] are based on the instruction 
groups and the group sequence patterns. In PicoJava II, six groups of bytecode instruc- 
tions are identified: Non Foldable (NF), Local Variable (LV), Memory (MEM), One 
Operand (BGl), Two Operands (BG2), and Two Operands with MEM (OP). The 
decode unit issues 74-bit micro-operations to the execution engine through the operand 
fetch unit. These bytecode instructions are converted into micro-operations by the 
decode unit, which then looks for folding patterns up to four consecutive instructions 
long. Once a pattern is found, the corresponding micro-operation replaces the original 
instructions. However, folding will not work if the operands are too far from the top of 
the stack, which results in low foldability of the folding unit [5]. 

A folding technique introduced by [9] categorized all the bytecode instructions into 
POC types and subtypes. The sequence of an instruction stream is scanned through a 
highly sophisticated folding-rule checker to determine its foldability. The folding-rule 
checker unit generates four output status signals: Serial Instruction (SI), Foldable 
Instruction (FI), Continuing state (C), and Ending state (E). The sequences of instruc- 
tion types that the checker receives are combinations of producer-type instructions (P), 
four different types of operators (Og, Og, O^, and Oj), and consumer-types of instruc- 
tions. This folding technique uses a 4-foldable strategy to simplify the decoder. The 
hardware structure of the folding unit is scalable (two to n foldable), and each unit con- 
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sists of 26 logic gates and 3 logic levels. Therefore, to build 4-foldable logic circuitry, 
a designer needs at least 96 logic gates, which would take 12 logic levels in the worst- 
case timing path. Although the POC model can fold more instructions than PicoJava, 
the folding unit gets more complicated and can’t handle many types of instruction 
sequences. 

Currently, all of the folding algorithms described above detect and fold a stream of 
instructions passively. This frequently results in losing many foldable instructions, 
which is due to the un-optimized instruction sequence rather than a deficiency in the 
folding logic. A new proposed POC model-based folding mechanism can manipulate 
the instruction sequences properly to maximize the folding logic algorithm with less 
hardware and simpler design. 

4. New Advanced Folding Model 

4.1. Characteristics of Instruction Sequences 

In order to develop an instruction folding mechanism, it would be essential to under- 
stand the nature of instruction sequences and their foldable types. It would be an ideal 
situation if a foldable instruction sequence is followed by another foldable sequence. 
However, many times foldable instructions are separated by other bytecode instruc- 
tions. Five distinct types of relationships between a foldable instruction group and its 
adjacent instructions have been found, as shown in Figure 1. The first type, namely 
Normal Sequence, is a sequence of two normal, consecutive-foldable groups. Most 
existing instruction folding units can fold this sequence type without any difficulties. 

Types 1, II, HI, and IV show special types of sequences that ordinary folding algo- 
rithms couldn’t detect or fold completely. These extra types show typical cases, which 
indicate that perfectly foldable instructions can be separated by a complete folding- 
instruction group or by non-foldable instructions. The foldability would be largely 
increased by finding all foldable patterns, even if the bytecodes in the foldable patterns 
are separated by other bytecodes. 

In Type I, a load instruction, “iload_2,” in group B, is separated by group A. The 
“iload_2” must be saved in an instruction queue and retrieved later to complete the 
folding process of group B. After execution of group A, the broken instructions in 
group B are then combined with the queued instruction and executed. 

Type II shows a little bit more complicated instruction sequence. Notice that the 
value in location 3 is pushed onto a stack by “iload_3” in group B, and the same loca- 
tion 3 is written back by “istore_3” in group A during a normal stack operation. There 
exists a single dependency between group A and B. In this case, group B should be 
executed first, followed by the execution of group A to maintain a correct value in 
location 3. 

Type ///is an example of a double dependency between groups A and B. Location 3 
is read by group C and written back by group A, while location 8 is read by group A 
and written back by group D. In this case, folding C with group D can’t give correct 
values in both locations 3 and 8. Therefore, instruction C, “iload_3,” and group D 
should be executed separately for correct operation. 

The last case, Type IV, is a variation of Type I. An outcome of group A is folded with 
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a previously queued instruction, “iload_3,” to form a new group B. Then, group B sub- 
sequently forms another foldable group C, and so on. The results of these folding types 
are pushed onto the stack in normal stack operations. However, in this research, these 
arithmetic results are kept in temporary registers or accumulators to avoid extra stack 
operations. 





— iload 4 


iload_2 


iload_3 
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— iload 4 




— iload 4 
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Type III 
C -> A -> D 



Type IV 
A -> B -> C-> D 



Fig. 1. Instruction Folding Sequence Types 



4.2. New Advanced POC Models 

A new POC model uses three major stack instruction types in the folding process. 
Producers, Operators, and Consumers, same as the previous POC model. These types 
are assigned to each instruction based on the bytecode instruction characteristics. 
Unlike the old POC model, the proposed POC model introduces new operator types, 
which are Producible Operator (Op) and Consumable Operator (Oc). Results of byte- 
code operations always become either producible or consumable types. These two new 
Operator types replace old Operators, Og, Og, O^, and Oj. 

This new POC model-based folding technique aggressively scans an instruction 
stream and attempts to locate the sequence type, as previously shown in Figure 1. If the 
sequence type is found, the folding unit combines foldable instructions and executes 
the sequence on the fly. Otherwise, the unit marks the instructions as a broken 
sequence in an instruction queue until they are popped and combined with the subse- 
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quent pieces of the sequence. Figure 2 illustrates a complete flow of this folding pro- 
cess. The bold-lined boxes and transitions indicate the portion of the proposed POC 
algorithm, which is not found in the traditional folding algorithms. Operations of Type 
1 and IV sequences are shown in this flow chart. A bold-oval state in Figure 2 repre- 
sents a dependency checker used for Type II and III sequences. 



1(b): bytecode instruction 
I(t): Type of 1(b), i.e. P,0,C 
I(s): Subtype of I(t), i.e. p,c 




Fig. 2. New POC Folding Model Algorithm 

As it will be shown in the results section, Type II and Type III rarely exist in normal 
bytecode sequences according to the benchmark testing statistics, which might be due 
to the enhancement of the Java compiler’s code optimization. However, there are a 
number of Java bytecode assemblers, e.g. Jasmin [11], and recent custom-generated 
class files that produce un-optimized bytecode structures. The dependency checker 
algorithm has been designed and presented for these cases, and can be implemented in 
the folding unit optionally. 

5, Experimental Results And Analysis 

Data collection is made from a simple Java program and SPEC JMV98 benchmark 
programs [6] using trace-driven analysis of run-time bytecode instructions. A simple 
Java program, “addsum2,” executes a loop to add integers from zero to ten. The SPEC 
benchmark programs selected for this experiment are “db”, “compress”, and “javac.” 
The “db” program performs multiple database functions on a memory-resident data- 
base with 1.2 million bytecodes executed. The “compress” program compresses and 
decompresses zip files in a high-compression ratio with 3.6 million bytecodes. Finally, 
“javac” is the Java compiler from the JDK 1.0.2, which executes about 6.9 million 
bytecodes. 
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The proposed POC model-based algorithm composes instruction patterns based on 
combinations of the instruction types, i.e. Producer (P), Producible Operator (Op), 
Consumable Operator (Oc), and Consumer (C). 

The new POC model-based instruction folder is designed to detect the sequence 
types as shown in Figure 1 in Section 4. The sequence types and the distributions in the 
benchmark programs are shown in Table I. The first numbers in the table are numbers 
of the folding types detected in the particular benchmark applications, and the numbers 
in parentheses are their percentages. Notice that Types II and HI rarely appear in the 
programs. Implementation of these types is not necessary for normal JDK-compiled 
Java programs. Therefore, hardware implementation of the dependency checker can be 
optionally omitted from the instruction folding design. The table also indicates that 
87% to 93% of foldable instructions are found across the applications. This research 
mainly focuses on instruction sequences that are foldable. 



TABLE I Distribution of Sequence Types in Benchmarks 





ADDSUM2 


DB 


COMPRESS 


JAVAC 


NORMAL 


70,404 

(44.6%) 


443,216 

(36.5%) 


442,810 

(12.5%) 


2,226,298 

(32.3%) 


Type I 


18,390 

(11.7%) 


207,836 

(17.1%) 


1,099,156 

(30.9%) 


963,736 

(14.0%) 


Type II 


16 (0%) 


61 (0%) 


176 (0%) 


212 (0%) 


Type III 


0 (0%) 


11 (0%) 


39 (0%) 


76 (0%) 


TYPE IV 


47,499 

(30.1%) 


455,310 

(37.5%) 


1,749,735 

(49.2%) 


2,922,121 

(42.4%) 


Non-Fold 


21,436 

(13.6%) 


108,998 

(9.0%) 


264,113 

(7.4%) 


785,272 

(11.4%) 


Total 


157,745 


1,215,432 


3,556,029 


6,897,715 



Unlike the traditional models, by detecting and folding a broken sequence, the new 
POC model-based folder is able to find more foldable instructions than the traditional 
folding mechanisms. Thus, the foldability improved dramatically in the proposed POC 
model-base folding mechanism. Figure 3 is a graphical representation of foldability of 
bytecode instructions between the new POC model, the old POC model, and PicoJava- 
based folding mechanisms for the benchmark programs. As shown in the data, the new 
POC model-based folder can consistently save the folding operations more than 90% 
throughout the applications. If the “invoke*” instructions are handled by a microcode 
operation, the foldability will be close to 100%. However, the old POC model and 
PicoJava folding achieved, at best, only 70% and 50% savings, respectively. Unlike 
the new POC model-base technique, the foldabilities of the old POC and PicoJava vary 
depending on the applications, e.g., very low foldabilities in “compress.” 
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■ New POC □ Old POC □ PicoJava 
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Fig. 3. Folding Operations Saved By Folding Tech- 



6, Implementation of Instruction Folding 

Designing a complete instruction folding unit is beyond the scope of this research. 
The research, however, focuses on the implementation of instruction decoding and pat- 
tern recognition units. Since the new POC model uses only four Operator types, a 
design of the type decoder is easier and simpler than the previous folding mechanisms. 

A folding pattern recognizer is designed based on the six different folding groups 
described in the previous section. The instruction decoding logic and the foldability 
checker are mapped to generic gates. The unit can decode up to six contiguous byte- 
code instructions to determine the foldability. This unit is also able to detect all possi- 
ble instruction folding patterns for most of the applications. 

According to the hardware synthesis data, total logic resources used for the six fold- 
able instruction pattern unit are 25 primitive gates with six logic levels. This shows 
considerable improvement over the old POC model’s folding unit, which requires 96 
logic gates in 12 logic levels for only four foldable instruction patterns. 

Alternatively, the unit can be much more simplified, in case a designer wants to 
design a four- foldable unit by sacrificing instruction foldability. The foldable unit only 
requires 14 logic gates with four logic levels. 

7. Future Work 

This research presents a theoretical approach to folding mechanisms and partial 
implementations of the major units. However, complete hardware implementation of 
entire folding units, including instruction queues and register units, is needed to mea- 
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sure and understand the overall performance enhancements to Java processors. An 
effort to port the new POC model-based folding structure to an existing Java FPGA 
processor is currently in progress. 

In addition, the proposed new POC model-based folding technique might be a good 
reusable software asset to be applied to Java compilers or Java class loaders. A 
research project that implements this POC model in a Java static class loader is now 
near completion. 

8, Conclusion 

In order to overcome an inherent performance problem with Java, an enhanced 
instruction decoding and folding process needs to be developed for Java hardware pro- 
cessors. In this paper, a new approach has been proposed by introducing an advanced 
POC model-based instruction folding technique. 

The characteristics of the bytecode sequences and the patterns have been clearly 
verified and analyzed. Various types of bytecode sequence patterns have been devel- 
oped and studied. Combining broken bytecode sequences dramatically improves the 
foldability. Through a series of experiments using SPEC JVM98 benchmark programs, 
a proposed POC model-based folder was shown to yield a very high foldability that 
consistently achieves 90% across the applications, while the previous POC model and 
PicoJava folding achieved only as much as 70% and 50% savings, respectively. Also, 
the proposed instruction folder is simpler and more efficient to implement in hardware. 
Currently, an effort to apply this new technique to an existing Java processor is under 
way. 
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Abstract. This article describes the LuChat chat system, which is a 
specihc instance of a web based collaborative white-board style of ap- 
plication. To increase portability of our system among the most popular 
browsers, we implemented our own (subset of) the Java RMI framework, 
which requires no installation effort on the client side and still runs in 
both Netscape and Internet Explorer. The performance of our RMI sys- 
tem is comparable to the Java RMI version that comes with Netscape’s 
Java virtual machine. The response time of our complete system under 
light load is under 30 ms, with the two most popular browsers having a 
response time of under 15 ms. Under normal use, our system will scale 
to a high number of clients. 



1 Introduction 

The Java programming language [5] has evolved from a language mainly used to 
liven up internet pages to an all-purpose programming language. We feel, how- 
ever, that Java’s main attraction lies in its inherent secure distributed nature, 
making it the language of choice for a wide variety of web based applications. 
The language offers powerful communication primitives like Remote Method In- 
vocation RMI [11,9], while the virtual machine itself allows dynamic distribution 
of object code in a secure way. 

In this paper, we introduce the LuChat chat system, which is an example 
of a collaborative application. The LuChat system consists of a chat server, 
implemented as a Java application, and a chat client, which is implemented as 
a Java applet, which runs in the most popular browsers without requiring any 
installation effort on the client side. We believe that the latter is a requirement 
essential for broad acceptance of any web based application. 

We will discuss the design and implementation of the LuChat system in 
Section 2. This section will also introduce LuchatRMI, which is a replacement 
for Java RMI, developed to increase the portability of our system to different 
web browsers. We will devote the largest part of this article to the assessment 
of the performance of our system. 
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2 Design and Implementation 

The LuChat system is a web based client-server application. The Chat Server and 
all the Chat Rooms run in one multithreaded process. The client components run 
as an applet inside a web browser on the client host. The chat server maintains 
the list of chat rooms, while each chat room manages activities taking place in 
these rooms. 

To implement communication in the LuChat system, we chose to use Java 
RMI, since it provides a simple communication model. RMI allows communi- 
cation between Java objects located in different Java Virtual Machines in a 
seamless way. Using Java RMI in a web based application does create a number 
of problems, however. 

The first problem encountered in communication between server and client 
applets is the tight security restrictions imposed on applets running in web 
browsers. Applets are only allowed to initiate communication with the host that 
they are downloaded from. To deal with this problem without requiring any ef- 
fort on the client side, all communication must be initiated by the clients, to 
the server. This makes it impossible to implement an active broadcast from the 
server to the clients in an efficient way, for instance by using a spanning tree 
algorithm. 

We chose to have clients continuously poll the chat room for new events in a 
separate thread. Since the server now has to communicate every event to every 
client individually, the server becomes a bottleneck. To alleviate this problem, 
we try to minimize communication, by combining multiple events in a message 
sent to the clients. 

A more severe problem is the poor support for Java RMI in certain 
web browsers. There do exist a number of alternative RMI implementations. 
KaRMI [8] was developed to be a more efficient alternative to Java RMI. To- 
gether with their improved object serialization [7], KaRMI outperforms Java 
RMI. NinjaRMI [6] was developed as part of the Ninja project and aimed to im- 
prove the performance of Java RMI as well as to extend its capabilities. Nexus- 
RMI [3] is an implementation of RMI built on top of Nexus Java [4], which allows 
interoperability [2] with HPC-I--I- [1]. Manta [10] implements a very efficient RMI 
by compiling the Java code to native code. None of these alternative facilitates 
distribution of the RMI runtime with the applet code. 

Therefore, we implemented our own (subset of) RMI, called LuchatRMI. The 
LuchatRMI runtime system can be sent to the clients with the applet code, in- 
creasing the download size of our chat system with 6.5 Kb. LuchatRMI is built 
directly on top of Java TCP/IP sockets. The latency of a method invocation 
without parameters and return value in LuchatRMI is 6.5 ms, against 1.9 ms for 
standard JavaRMI. In this case, each method invocation sets up a new connec- 
tion between stub and skeleton. If way change the stub and skeleton to reuse a 
previous connection, the latency for LuchatRMI drops to 0.6 ms. Figure 1 com- 
pares the throughput for different data sizes. The sizes specifies the size of the 
parameter as well as the size of the return value. Again, we see that reusing 
connections significantly improves performance. 
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data size (bytes) 



Fig. 1. Throughput comparison 



3 Performance 

We conducted our evaluation of the chat system performance of a cluster of 
Pentium II workstations running Windows NT 4. The server is running on a 
Pentium II machine with 128 Mb of memory, running Solaris 7 Intel Edition. 
Clients and server are connected with a 10 Mbits/sec Ethernet. Unless indicated 
otherwise in the text, we ran the clients in the Java Virtual Machine incorporated 
in the Netscape 4 browser. 

The performance is obtained by using robot chat applets, that send ’say’ 
events at random intervals (5-10 seconds to reflect a normal interactive chatter). 
We report the averages over an interval of 25 events. Each experiment was run 
five times, in order to identify deviations in the measurements not caused by the 
application. Each time, the average of five designated clients are reported. These 
five clients each run on their own host. The rest of the clients, up to a maximum 
of 75, are distributed among 15 host machines, whereby we made sure that all 
machines were equally loaded. 



3.1 Java RMI vs. LuChat RMI Performance 

Figure 2a compares the response times observed by LuChat clients under varying 
load using LuChat RMI and Java RMI. In this case, all clients are connected to 
the same room. 

With a low number of clients we obtain a 10 ms respond time. As can be seen, 
the response time under higher load is still such that it will hardly be noticed 
(40 ms). 

We see that Java RMI is a bit slower under light load (20 ms) and equals 
LuChat RMI performance under higher load. With a load of 80 clients, however, 
the implementation using Java RMI loses approximately 10% of the chat events, 
whereas the implementation using LuChat RMI loses less than 1% of the chat 
events. 
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(a) Response times 



(b) Client CPU usage 



CPU load (server side) 



Network load 





(c) Server CPU usage (d) Network load 

Fig. 2. Java RMI - LuChat RMI comparison 



To assess what causes the difference in performance, we compare the cpu 
usage at both the client and the server side. Figure 2b compares the cpu usage 
on the client side for LuChat and Java RMI. 

As can be seen, the client cpu usage when using Java RMI is a factor 1.5 
higher than when using LuChat RMI. Also, on the server side, the cpu usage on 
high loads is higher when using Java RMI than when using LuChat RMI, as can 
be seen from Figure 2c. Remember that the implementation using Java RMI has 
a high event loss at 80 clients, which can also cause the server cpu usage to be 
lower, since it does not broadcast all events. 

We also measured the network load using the standard ’snoop’ utility from 
UNIX. This utility captures all incoming TCP packets and counts them. Fig- 
ure 2d compares the network load when using LuChat RMI and Java RMI. 

We see that the network load when using Java RMI is lower for higher loads 
than the network load for LuChat RMI. These graphs, however, depict packets 
per second and not bytes per second. 
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3.2 Scalability Issues 

Both using LuChat RMI and Java RMI, with a load of more than approximately 
75 clients connected to one and the same room, the performance and reliability 
of the chat system degrades. In this section, we will show experimental results 
obtained with similar load, but distributed differently. We will use LuChat RMI 
for these experiments, since that is the most portable version. 

In the next experiments, we again take the average of five designated clients, 
that are all connected to the same room. The additional clients, again distributed 
over 15 machines, are now also distributed over 15 chat rooms (all running on 
the same server machine). In the graphs, the results of this test are marked ’16 
rooms’. Next, we run the same test, but put all additional clients in one room 
again, which is a different room from the one the initial five are connected to. The 
results of this graph are marked ’2 rooms’. The original test from the previous 
section is marked ’1 room’. 

Figure 3a compares the response times for the three different methods of load 
distribution. As can be concluded, the response times do not increase when the 
load in other rooms increase. This is a significant result, since this implies that, 
although the number of clients connected to one particular room is restricted to 
approximately 75, the complete chat system scales to many more clients, as long 
as the load is distributed over multiple rooms. Furthermore, the results for ’2 
rooms’ implies that a heavy load in one particular room does not influence the 
performance of another room. 

Looking at the client CPU usage depicted in Figure 3b, not surprisingly, its 
usage does not increase when the number of messages processed by the client 
does not increase. Figure 3c compares the CPU usage at the server. We see that 
when the load is distributed of 16 rooms, the CPU usage increases less than 
when a high load is imposed on one particular room. This can be explained by 
looking at the total number of client requests that the server has to process. 
In all cases, the same number of events are generated by the clients. However, 
when the load is distributed over 16 rooms, each event has to be sent to only 
five clients, instead of to all clients. As can be expected, this is also reflected by 
the network load, as shown in Figure 3d. 

Considering that in a real chat session, having a such a high load as we used 
in these test in one room, makes chatting clearly intractable, these results are 
certainly encouraging. 



3.3 Platform Comparison 

In this section, we will evaluate the Java performance on different operating 
system and Java virtual machine combinations. We will evaluate the performance 
of the chat system using our base setup where all clients are connected to one 
and the same room. Again, five clients each running alone on a certain host, 
will provide the actual results. Only the version with LuChat RMI will be used, 
since that is the most portable version. The numbers for Netscape on Windows 
are repeated here for comparison. 
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(a) Response times 



(b) Client CPU usage 





(c) Server CPU usage (d) Network load 

Fig. 3. Different load configurations 



The first platform we will compare with is Internet Explorer 5 running on 
Windows NT. The next platform will be Linux, using the virtual machine that 
comes with Netscape 4. Linux quickly gains popularity as an alternative for 
Microsoft Windows, so it is interesting to see how it performs. The last platform 
will be Sun’s own Java virtual machine that comes with the JDK 1.2. From this 
distribution, we use the applet viewer to run our tests. 

The performance of the chat system on each of these platforms is depicted in 
Figure 4. As can be seen, Internet Explorer on Windows outperforms all other 
platforms. With Netscape on Linux we experience the highest response times, 
although the performance would still be acceptable for real world usage. Sun’s 
appletviewer loses a significant amount of events (15%-20%) with a load of 65 
clients, which explains why the observed response times improve. Because of the 
high message loss percentage, we did not bother measuring beyond 65 clients. 

The amount of CPU usage at the client side could explain the difference in 
response times. The CPU usage on Netscape / Linux is twice as high as the CPU 
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Fig. 4. Platform comparison 



usage on Netscape / Windows. Internet Explorer is the least CPU demanding 
platform, which could explain its good performance. 

The difference in CPU usage at the server could be a cause of clients not 
being able to keep pace. When the server detects a slow client, it will allow it to 
catch up by not providing all events, which also means less work for the server. 
Note that the difference becomes significant only under high load. The same goes 
for network usage. Less events sent to clients means less network usage. 

4 Conclusions 

In this article, we introduced our LuChat system, which is a specific instance of a 
collaborative world wide web based application. Since these types of applications 
become more important, we believe it is useful to asses the overall performance 
of such a system. 
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We also introduced the LuChat RMI system, which we developed to broaden 
browser support. The main attraction of our LuChat RMI system is that its 
runtime can be transmitted along with the application, and therefore requires 
no installation on the client side, making it an attractive alternative RMI system 
for web based applications, that need to run on a broad range of web browsers. 
This system is not a complete RMI implementation, but it does support basic 
RMI applications. 

The performance of our LuChat RMI system is competitive to the RMI im- 
plementation distributed with Netscape. The performance of the overall system, 
at least on a Local Area Network is acceptable. We still need to assess its perfor- 
mance in real usage. We did, however, use the system with a very small number 
of users divided between the USA and Netherlands, and there the performance 
was acceptable. 
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Abstract. In this paper, we investigate the multi-node broadcasting 
problem in a all-ported 3-D wormhole-routed torus. The main technique 
used in this paper is based on a proposed aggregation-then-distribution 
strategy. Extensive simulations are conducted to evaluate the multi- 
broadcasting algorithm. 



1 Introduction 

A massively parallel computer (MFC) consists of a large number of identical 
processing elements interconnected by a network. One basic communication op- 
eration in such a machine is broadcasting. Two commonly discussed instances 
are: one-to-all broadcast and all-to-all broadcast, where one or all nodes need to 
broadcast messages to the rest of the nodes [1]. A more complicated instance is 
the many-to-all (or multi-node) broadcast, where an unknown number of nodes 
located in unknown positions each intending to perform a broadcast operation. 

Saad and Schultz [3] [4] initially defined this problem and proposed a simple 
routing algorithm for hypercubes. A distributed approach to improve the load 
imbalance problem was presented by Tseng [6] for hypercubes and star graphs. 
However, their approach are attempted to reduce the node contention problem, 
which is not a congestion-free result. 

This paper addresses the multi-node broadcasting problem in wormhole- 
routed 3-D tori. Our approach is based on an proposed aggregation- then- distri- 
bution strategy. The major work of this paper is to present how to develop a 
multi-node broadcasting using aggregation- then- distribution strategy in wormhole- 
routed 3-D tori. Given a multi-node broadcast problem with an unknown num- 
ber of s source nodes located on unknown positions in an torus each intending 
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Fig. 1. (a) An example of DDNs , (b) DCN in a 3D torus. 



to broadcast an m-byte message, our approach can solve it efficiently in time 
0(max( [logy n] , / i)T 5 -|- maxd'logy \f~\~\j^m,nm)Tc), where h is the number of 
independent subnetworks. It is shown that this number has outperformed the 
aforementioned congestion-free scheme using edge-disjoint-spanning-trees. 

2 Basic Idea 

In this paper, we consider G as a 3-D torus Tmxn 2 Xns with rii x ri 2 x ns nodes. 
In 3-D tori each node is denoted as Pij,k, 1 < i < rii, 1 < j < ri 2 , l<A:<n 3 , 
and has an edge connected P(ii±i)modm,i 2 ,i 3 along dimension one, an 

edge to Pii,(* 2 ±i) modn 2 ,ns along dimension two and an edge to Pn,i 2 ,(is±i) modna 
along dimension three. The wormhole routing model is assumed [2]. Under such 
a model, the time required to deliver a packet of L bytes from a source node 
to a destination node can be formulated as Tg + LTc, where Tg is the start-up 
time and T^. represents the transmission time In addition, we adopt the all-port 
model and the dimension- ordered routing [6]. 

2.1 Network Partitioning Scheme on 3-D Toms 

Consider a 3-D torus Tnjxraaxns- Suppose that there exists an integer h such 
that ni,ri 2 , and ns are divisible by h. We define hxh data- distribution network 
DDNu,v ={Vu,v ,Cu)=DDN i, u,v=0..h — 1 and i = u * h v, as follows: 

. . _ {Pi,j,i \i = ah {{u -b v) mod h),j = bh -\-v,l = ch-\-u, 

“ for ail a = -l,6 = 0..r^l -l,c = 0..rf-l -1} 

Cu,v = {all channels at a;-axis ah-\-u -\-v , y-axis bh v and z-axis ch -|- u} 

Each DDN is a dilation-h 3-D torus of size [ x [ x [ , such that each 
edge is dilated by a path of h edges. Fig. 1 illustrates an example where the block 
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Fig. 2. An example of (a) routing matrix and (b) collecting routing matrix. 



nodes denote DDN and the gray zone represents DCN. The 3-D torus T„jxn 2 x «3 
is partitioned into (],ata collecting network DCN ^ = (Va,b,c,Ca,b,c), 

a = 0..rii — 1, b = 0..ri2 — 1, c = 0..ri3 — 1, and 1 < A: < ag follows: 

Va,b,c = {Pi,j,i\i = ah + x,j = bh + y,l = ch + z, for all x,y,z = 0..h — 1 } 
Ca,b,c = {the set of edge induced by K.t.c in Tnjxnsxnsl- 

2.2 Algebraic Notation 

In the following, we adopt an algebraic notation to represent our routing al- 
gorithm. The torus of size n is an undirected graph. Each node is denoted as 
Pxi,x 2 ,...,xk^ 0 < Xi < n, 1 < i < k. Our routing algorithms is based on the con- 
cept of ” span of vector spaces” in linear algebra. Conveniently, the Tth positive 
(resp., negative) elementary vector is denoted as e, (resp., e_j) of Z'^, i = l..k. 
We may rewrite as ~ i—^h + ^ -h) and 

-I-.. .-I- as For instance, ^i, 3 =^i + ^3 and e{i__ 3 =e’i — 6 * 3 . 

Lemma 1 In , given a node x, an q-tuple of vectors B=( bi, 6 2, bq), and 
q-tuple of integer N =(ni,U 2 , ■■■, Ug), the span of x by vectors B and distances 
N is defined as 



SPAN{x, B , N) = |a;-|- JJ ai fe, |0 < a, < n,}. 

i=l 

A 3-D tori is viewed as SPAN{Pofi,o,(ei,~e 2 ,~e 3 ),{n,n,n)). We introduce 
two matrices: delivery routing and distance matrices. A delivery routing matrix 
R = [rj,j] 3 x 3 is a matrix with entries —1,0,1 such that each row indicates a 
message delivery; if rtj = 1 (resp. —1), the corresponding message will travel 
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along the positive (resp. negative) direction of dimension j; if = 0, the 
message will not travel along dimension j. A distance matrix D = [djjjsxs is 
an integer diagonal matrix (all non-diagonal elements are 0); di^i represents the 
distance to be traveled by the i-ih message described in R along each dimension. 

For instance, the six message deliveries in Fig. 2(a) have three directions and 
thus can be represented by a delivery routing matrix: 





'eps' 




'10 r 


R = 


62,3 


= 


0 1 1 




63 




0 0 1 



In general, the node Pi,j,k sends M to six nodes Pi+cn,j,k+ai, Pi,j+02,k+a2, 
Pi,j,k-\-0'3, 1 ,j, fe-t-CK — 1 , Pi,j-\-a — 2,k-\-a—2, Pi,j,k-\-a — 3, (note that ~ 

t is block size, see Section 3 for detail for deriving t). So we can use two distance 
matrices: 



ai 0 0 




q:_i 0 0 


0 0:2 0 


and D = 


0 q:_2 0 


0 0 0:3 




1 

0 

0 

P 

1 

CO 

1 



and the 6 message deliveries in Fig. 2(a) is represented by matrix multiplication: 



CKi 


0 


CKi 




CK — l 


0 


CK — l 


0 


OL^ 


1 OL 2 


and D x R = 


0 


Q:-2 


CK-2 


0 


0 


0:3 _ 




0 


0 


q:_3 _ 



Further we define a similar routing matrix, namely as collecting routing ma- 
trix C. A collecting routing matrix C = [cjjjsxs is a matrix with entries —1, 0, 1 
such that each row indicates a path of collected message; if = 1 (resp. — 1), 
the corresponding message will be collected from neighboring node along the 
positive (resp. negative) direction of dimension j; if = 0, the message will 
not be collected from neighbor along dimension j. For instance as shown in Fig. 
2(b), given a collecting routing matrix 





’61,3' 




'10 r 


C = 


62,3 


= 


0 1 1 




63 




0 0 1 



then matrix multiplication 



D+ xC 



■f 0 r 




■-1 


0 


-1' 


0 1 1 


and D x C = 


0 


-1 


-1 


0 0 1 




0 


0 


-1 



3 Multi-Node Broadcasting in 3-D Torus 

3.1 Aggregation Phase 

Step 1: Diagonal-Based Data- Aggregation Operation The main function 
of data-aggregation operation is to regularize the communication pattern before 
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the multi-node broadcasting. Let a 3-D torus be represented as 5PA7V(Po,o,o, 
( e 1 , 3 , e 1 , 2 , e i), (n, n,n)). Each DCN is viewed as SPAN{Px,y,z, ( e 1 , 3 , e 1 , 2 , e 1 
), (/i, h, h)), where 0 < x = ih,y = jh, z = kh < n. The data-aggregation opera- 
tion is to aggregate all possible messages to a special plan in each DCN in which 
the plan is denoted as diagonal plan represented by SPAN{Px,y,z^ (^ 1 , 3 , 61,2 
), (h,h)). All nodes are aggregated messages into diagonal plane SPAN{Px^y^z, 
(^1,3, ^1,2), (h,h)). This operation is represented by 



■f 0 o' 




0 

0 




1 

0 

0 


0 1 0 


, D+ = 


0 2t 0 


, P- = 


0 

1 

to 

0 


00 1 




1 

0 

0 

CO 
1 




1 

0 

0 

1 

CO 



and 



0 

0 




—t 


0 


0 


0 2t 0 


and D x C = 


0 


-2t 


0 


1 

0 

0 

CO 
1 




0 


0 


-3t 



Lemma 2 Diagonal-based data-aggregation operation ean be reeursively performed 

on a T^xnxn in [log^ h']Ts + = [log^ h']Ts + ~i mTc- 

i=0 



Step 2: Balancing-Load Operation After applying data-aggregation oper- 
ation, each DDNo, DDN^,..., and DDNh‘^_i has different amount of messages. 
This is load imbalance, a data tuning procedure is presented for load balancing. 
This operation is divided into prefix-sum and data-tuning procedures. 

Prefix-Sum Proeedure: After data-aggregation operation that all source nodes’ 
messages are aggregated to regular positions, which in diagonal plane SPAN 
{Px,y,z, (whereO< x = ih,y = jh,z = kh < n. All those planes constitute a 
special cube 5PAA'(Po,o,o, (^1,3, ^1,2, ^1), {n, n, [fl))- Our diagonal-based 
recursive prefix-sum procedure is to calculate prefix-sum value for each keeping- 
message node in 5PAA'(Po, 0,0, ( e 1,3, 6 i, 2, ei),(n,n, The diagonal-based 

prefix-sum procedure is divided into forward and backward stages. In forward 
stage, information of number of messages is aggregated from cube to a plane, 
from plane to a line, and from line to one node. After the forward stage, total 
number of whole source messages is kept in one node. In backward stage, partial 
prefix-sum value is return from node to a line, from line to plane, from plane to 
cube. Herein we omit the detail operations since the work is trivial. 

Data Tuning Proeedure: Assume that a node x is located in DDNij, with a 
destination list. The information of destination list is to indicate that node x 
should move message to which neighboring nodes. To satisfying the following 
purpose, for node x, if {k, 1) € destination list, one message from DDNij (node x) 
is moved to DDNk,i- Every node x performs the following operation in parallel. 
(1) Finding a destination list: Having a prefix-sum value a and number of 
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keeping-message (3, then destination list is F = {ct mod /i^ , (ct -I- 1) mod/i^, 

{a + (3) mod/i^} if number of DDNs is /i^. Two communication steps are needed 
if intend to moving data from DDNij to DDN^,i- Note that F' is a sequence of 
pairs which is constructed as follows. For every t g F, let (i = t mod h,j =t/h) g 
F', where i,j indicate the offset value of row and column tuning actions in data 
tuning operation. (2) Data tuning operation: The data tuning operation is 
divided into row tuning and column tuning actions which is formally described 
below. 

Tl. Row tuning action (DDNij -^DDN k,j)' An extra alignment operation 
is executed due to the dimension-order routing. If \i — k\ < 3, then we allow 
DDNij -^DDNi^ij, DDNi^ 2 ,j, and DDNi^sj within two communication 
steps. For each node in diagonal plane of DDNij, we first align DDNi^-ij 
along dimension-X with distance ±1, DDNi^ 2 ,j along dimension-T with dis- 
tance ±2, and DDN i±zj along dimension- .Z with distance ±3 to six meta- 
nodes. Every node Px,y,z in diagonal plane DDNij distributes its messages 
to six nodes Px — l,y — l,z: Px + l,y+l,zj Px,y — 2,zj Px,y+2,z : Px,y,z — Zj and Px,y,z+Zj 
which is represented by 

■f 1 0] 0 [-1 0 0 ■ 

R = 0 2 0 , where D~^ = 0 10 and D~ = 0—10 

_0 0 3 J [o 0 ij [00 -1_ 

T2: Column tuning action (DDN ^j -^DDN j- i): This action can be repre- 
sented by 





A 0 1] 


'1 0 O' 


0 

0 

T— 1 
1 


R = 


0 0 2 , where D~^ = 


0 1 0 


and D = 0—10 




0 0 oj 


0 0 1 


1 

T— 1 
1 

0 

0 
1 



3.2 Distribution Phase 

Step 1: Alignment Operation (1) Alignment to diagonal plane: All pos- 
sible message are aligned into the diagonal plane. This task can be easily achieved 
by performing the diagonal-based data aggregation operation as introduced in 
Section 3.1, which takes time riog 7 ([^])](Ts + mTc), where to = (2) All- 

to-all broadcasting procedure on diagonal plane: This procedure is to 
collect messages of each node in the diagonal plane SPAN{Px,y,z, (ei, 3 , 61 ^ 2 ), 
(TfI’ TfI)) from other nodes located in the same diagonal plane SPAN{Px,y,z, 
(^ 1 ^ 3 , ^ 1 , 2 ), (r^li Tfl))- The plane can be viewed as rows or columns. 
Two broadcasting operations are needed. Basically, this work is the row and 
column tuning operations with different distance matrices T>+ and D~ , which is 
same as the row tuning (Tl) operation and the column tuning (T2) operation. 

Step 2: Broadcast Operation Now every node in diagonal plane of each 
DDN contains same broadcast messages. The next step is to perform a well- 
known result, the diagonal broadcast scheme in 3-D torus [7], on each DDN in 
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parallel. The diagonal plane 5 PA 7 V(P 3 ;,j,,z, (ei^s, 61 ^ 2 ), (Tfli Tfl)) has partial 
source messages, and the broadcasting is based on a recursively sending messages 
from a diagonal plane to six planes. We mention that the operation is executed 
in time [logy (T^ + mTc), where to = 



Step 3: Data Collection Operation Each data collecting network (which is 
hx hxh mesh), each diagonal plane received messages Mo, Mi, ..., and M/j 2 _i. 
Each received message containing the whole messages of one DDN. These mes- 
sages should be propagated to every node of the DCN. This is implemented in 
three stages: row broadcasting followed by column and horizontal broadcasting: 
(1) In the row broadcasting stage, we use a recursive scheme. Node located in 
diagonal plane send messages to two nodes with distance ±|/i and recursively 
propagate the message. This take [logg /i] communication phases (2) In the 
horizontal broadcasting, every node collects the partial messages from the row 
broadcasting stage. The messages are belong its column nodes, every node con- 
currently send separate message to other nodes with pipelined scheme. A logical 
(directed) ring is embedded on each column of the DCN. The gives a dilation-2 
embedding. With this embedding, every node then pipelines propagate its own 
message following ring. Finally, we have the following result. 

Theorem 1 The multi-node broadcasting algorithm with aggregation- then- dis- 
tribution strategy can be done in a Tnxnxn torus within 

0(max( [logy n],/i)T^ -I- max(|'log 7 ^m,sm)Tc). 



4 Performance Comparison 

We mainly compared our scheme against the multiple-spanning-tree scheme [5] 
under various situations. The parameter used in our simulations are listed below: 
(1) the torus size is 16 x 16 x 16, (2) startup time Tg = 30/isec and = 1/isec, 
(3) dilation h = 7 or 14, (4) the message size is ranging from 2k to lOA:. Below, 
we show our simulation result from several prospects. (A) Ejfects of Number of 
Sources: Fig. 3 shows the multi-node broadcast latency when Tg = 30/isec and 
Tg = 1/isec at various number of sources. Our scheme when h = 7 incurs higher 
latency than that of multiple-spanning-tree scheme, while our scheme when h = 
14 has lower latency than that of multiple-spanning-tree scheme. It reflects the 
fact that our scheme has better performance than multiple-spanning-tree scheme 
at various number of sources. (B) Ejfects of Number of h: The value of h reflects 
the number of subnetworks, and thus the level of communication parallelism. 
So a larger h generally delivers better performance. Fig. 3 also compares multi- 
node broadcast latency when h = 7 and 14. Observe that our scheme has lower 
latency when /i = 14 than our scheme h = 7. This is verifled that high level of 
communication parallelism is, the better performance will be. 
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Fig. 3. Multi-Node Broadcast latency in a 16 x 16 x 16 torus at various number of 
source nodes. 
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Abstract. Previous research has pointed out the influence of adaptive 
routing on the performance improvement of interconnection networks 
for clusters of workstations. One of the design issues of adaptive routing 
algorithms is the selection function, which selects the output channel 
among all the available choices. In this paper we analyze in detail sev- 
eral selection functions in order to evaluate their influence on network 
performance. Simulation results show that network throughput may be 
increased up to 10%. When network is close to saturation, improvements 
in latency up to 40% may be achieved. 



1 Introduction 

Networks of workstations (NOWs) are usually interconnected by an irregular 
topology, which makes routing and deadlock avoidance quite complicated. Dead- 
locks can be avoided by removing cyclic dependencies between channels [6] . As a 
consequence, many messages are routed following non-minimal paths, therefore 
increasing message latency and wasting resources. A more efficient approach con- 
sists of allowing the existence of cyclic dependencies between channels while pro- 
viding some escape paths (as dedicated virtual channels) to avoid deadlock [3,7]. 

Virtual channels are not only useful for designing deadlock-free routing al- 
gorithms. They may also be used in wormhole networks to increase link utiliza- 
tion [2]. On the other hand, virtual channels also enable the use of adaptive 
routing algorithms. Adaptivity makes possible to have several outgoing ports for 
each destination, being necessary to perform a selection among all the feasible 
outgoing ports. Therefore, we may divide the task of routing a message into 
two different phases [3]. In the first one, a routing algorithm provides a set of 
suitable outgoing ports to reach the message destination. In the second phase, 
one of the outgoing ports provided is selected according to some criterion. This 
second phase is performed by the selection function. 

Previous work [7] has pointed out the great influence that routing algorithms 
have on network performance. However, the influence of selection functions on 
performance has not been analyzed. In this paper, we take such a challenge, 
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evaluating the performance of different selection functions specially designed to 
be used in conjunction to the adaptive routing algorithm proposed in [7]. 

The following sections are organized as follows. Section 2 provides some back- 
ground on routing in networks with irregular topology. In Section 3 we will 
describe the selection functions later evaluated in Section 4. Finally, some con- 
clusions are drawn. 

2 Routing in NOWs 

Several deadlock-free routing schemes have been proposed for irregular net- 
works [6, 1,5, 7]. Our analysis is centered on the Minimal Adaptive (MA) routing 
scheme [7]. The MA routing algorithm splits each physical channel into two vir- 
tual channels, called “original” and “new” channels, respectively. Original chan- 
nels are used following the up*/down* routing algorithm used in Autonet [6]. 
A newly injected message can only leave the source switch using new channels 
belonging to minimal paths. When a message arrives at a switch through a new 
channel, the routing function returns a new channel belonging to a minimal 
path, if available. If all of them are busy, then the up*/down* routing algorithm 
is used, selecting an original channel belonging to a minimal path or to the 
shortest path if a minimal path is not available. To ensure deadlock freedom, 
once a message reserves an original channel, it will be routed using only original 
channels according to the up*/down* routing function until delivered. 

Note that links may be split into more than two virtual channels. In this 
case, all of them except one would be used as new channels, while the other one 
would be the original channel. This increases adaptivity and also the number of 
messages that follow minimal paths. 

3 Selection Functions 

As mentioned above, two different phases occur when routing a message: the 
routing function and the selection function. The selection function selects one 
virtual channel from the set provided by the routing function. Usually, this 
selection takes into account the state (free or busy) of output virtual channels. 
In this case, two are the incoming parameters of the selection function: (i) the 
set of virtual channels provided by the routing algorithm as suitable outgoing 
switch ports for the message and (ii) the set of free output virtual channels. 
Using this information, the selection function chooses the best outgoing port for 
the message, according to some criterion. It can also decide to block the message 
until a better choice can be done. 

In the MA routing algorithm (see Section 2), since new channels offer more 
freedom to route messages, selection functions should assign more priority to 
new channels than to original channels. However, several possibilities still exist, 
as we propose next. 

The Static Priority (SP) selection function cyclically distributes priorities 
among the virtual channels of different physical channels. Figure 1 shows these 
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►Original Channel 

— -^New Channel 



Fig. 1. Virtual channel priorities for the SP selection function 



priorities for a switch with two links and three virtual channels per link. Virtual 
channel identifiers are inside the circle, next to the border. Big numbers inside 
the circle refer to physical channel identifiers. Numbers outside the switch in- 
dicate the priority assigned to that virtual channel. As can be seen, the lower 
priorities are assigned to original channels. Note that new channels are always 
assigned higher priorities than original ones. The aim of this selection function is 
to balance virtual channel multiplexing of physical channels, while being easily 
implementable in hardware. 

The least recently used (LRU) selection function returns the least recently 
used new channel that is free, if any. If no new virtual channel is free, then it 
selects the least recently used free original channel. To implement this function, 
it is necessary to use a register of log2{N * M) bits per virtual channel, where 
N is the number of physical links and M is the number of virtual channels 
per physical link. Initially, this register is set to zero. When a message releases 
a virtual channel, all registers whose value is less or equal than the value of 
the virtual channel register increase their value by one and the register of the 
released virtual channel is set to zero. In this way, the higher register value a 
virtual channel has, the higher priority it is assigned. If several registers have 
the same value, a static priority is used (the same as SP). 

The least frequently used (LFU) selection function returns the least fre- 
quently used new virtual channel from the set of free feasible output virtual 
channels. If there is no feasible new channel, then the selection function returns 
the least frequently used original virtual channel. In order to implement this 
selection function, it is necessary to log all the history of the channel in order 
to calculate channel utilization, being necessary a register per virtual channel to 
store the activity information. This register should be extremely large if we try 
to accurately calculate channel utilization, noticeably increasing hardware com- 
plexity. In order to reduce register size, we may calculate the channel utilization 
only for the last N clock cycles. In this case, we need a TV-bit shift register and 
also an additional counter of log2N bits per virtual channel. Initially, both the 
register and the counter are set to zero. The register is shifted every cycle. If 
the virtual channel transmits a flit in that clock cycle, a one is inserted into the 
shift register, and the counter is increased by one. In case the link remains idle 
during that clock cycle, a zero is inserted into the shift register and the counter 
is not increased. On the other hand, every clock cycle a one is shifted out the 
register, the counter is decremented in one unit. Hence, the counter value is the 
number of ones in the shift register, and thus, the channel utilization in the last 
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N clock cycles. Using this information, the channel with the lower counter value 
takes the higher priority. 

The minimal multiplexation (MM) selection function (the one proposed 
in [8] ) tries to minimize the number of messages being multiplexed onto a physi- 
cal link at the same time that maximizes adaptivity. This selection function will 
select from the set of free feasible output virtual channels, the one belonging to 
the physical link having the highest number of free virtual channels. If two phys- 
ical links have the same amount of free virtual channels, this function will select 
the one with the higher number of free new virtual channels. As in previous se- 
lection functions, new virtual channels have always higher priority than original 
virtual channels. To implement the minimal multiplexation selection function, 
some hardware is required to perform the necessary comparisons. 

The random (RAND) selection function consists of randomly selecting one 
channel from the set of free feasible outgoing new virtual channels. If there is no 
free new channel then it randomly chooses an original channel. In this case, it is 
necessary to implement a random function by hardware. 

Finally, we have evaluated the influence of delaying the routing decision in 
order to select a better channel when all of the feasible new virtual channels are 
busy. Original channels provide less adaptivity and usually longer paths than new 
channels. The use of non-minimal paths makes messages to use more resources 
than necessary, increasing the probability of blocking another messages and also 
decreasing performance. Selection functions presented above try to minimize the 
use of original channels by assigning them the lowest priorities. However, when 
no new virtual channel is free, messages must leave the switch through one of 
the original channels. It is possible to minimize even more the use of original 
channels by stopping the message at the current switch instead of allowing it to 
be immediately routed through an original channel even if it is available. The 
message would wait for a new channel becoming free. However, in order to avoid 
deadlock, messages must not wait indefinitely, being necessary the use of a time 
threshold. In this way, the message will be allowed to leave the switch using an 
original channel only if it is waiting for longer than the threshold. 

We propose the use of two different thresholds. The first one is a simple 
timeout. When a message arrives at a switch, a counter attached to the input 
virtual channel is triggered. If the message has not being successfully routed 
when the counter exceeds the threshold, then the message is allowed to leave the 
switch through a suitable original channel. Note that the optimum value of this 
threshold depends on message length. 

The second one is based on monitoring the activity of the requested outgoing 
channels. Only when all of the feasible outgoing virtual channels for that mes- 
sage are busy, and none of them is transmitting flits for a time that exceeds a 
threshold, the message is allowed to leave the switch through an original chan- 
nel. This idea is based on the deadlock detection mechanism proposed in [4]. 
The advantage of this mechanism with respect to the use of timeouts is that the 
optimal value of the threshold should be less dependent on message length. Note 
that both mechanisms are applicable to any selection function. 
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4 Performance Evaluation 

In this section, we evaluate by simulation the performance of the proposed se- 
lection functions. We will refer to them as the acronym used when presented. In 
addition, if thresholds are used, we will append a suffix to the selection function 
identifier. For the first case (simple timeouts), we will use the suffix Thjn, be- 
ing n the threshold measured as multiples of the message length, in clock cycles. 
For the second case (deadlock detection), we will use the suffix Detjn, being n 
measured in clock cycles. 



4.1 Network Model 

Network topology is irregular and has been generated randomly, imposing three 
restrictions: (i) there are 4 workstations connected to each switch, (ii) two neigh- 
boring switches are connected by a single link and (iii) all the switches have the 
same size (8-port switches). We have evaluated networks with a size ranging 
from 16 switches (64 workstations) to 64 switches (256 workstations). For the 
sake of brevity, we will only show results for 64 switches. Wormhole switching is 
used. 

Each switch has a routing control unit which applies the MA routing strat- 
egy. A crossbar inside the switch allows multiple messages traversing it simul- 
taneously. We assumed that it takes one clock cycle to compute the routing 
algorithm, or to transmit one flit across a crossbar. Link propagation delay has 
been assumed to be 4 cycles. Links are pipelined. Data are injected into the link 
at a rate of one flit per cycle. 

The MA routing algorithm has been evaluated with two, three, and four 
virtual channels per link. For the sake of shortness, we present our results only 
for four virtual channels. For two virtual channels, differences among choices are 
not significant, and for three and four virtual channels results are very similar. 

Message generation rate is constant and the same for all workstations. We 
have evaluated the full range of traffic, from low load to saturation. Message 
destination is randomly chosen among all the workstations in the network. For 
message length, 16-fiit and 64-fiit messages were considered. 

4.2 Simulation Results 

Figure 2 shows the average message latency versus traffic for the evaluated se- 
lection functions. As can be seen, MM achieves lower latency than the rest of 
selection functions. Compared with the worst selection function, MM improves 
latency about 15% for short messages and 20% for long messages when the net- 
work is near saturation. The performance of the SP selection function is very 
close to MM, because both algorithms distribute the outgoing messages between 
all the possible virtual channels. Hence, both selection functions make physi- 
cal link to be more evenly multiplexed. Moreover, they try to minimize link 
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Traffic (flits/cycle/node) 

(a) 



Traffic (flits/cycle/node) 

(b) 



Fig. 2. Average message latency versus traffic for the analyzed selection func- 
tions. Network size is 64 switches. Message length is 16 (a) and 64 (b) flits 





Traffic (flits/cycle/node) 

(a) 



Traffic (flits/cycle/node) 

(b) 



Fig. 3. Comparison of different thresholds with MM. Average message latency 
versus traffic for a network with 64 switches. Simple timeouts are used. Message 
length is 16 flits in (a) and 64 flits in (b) 



multiplexing, thus allowing messages to advance faster. As a consequence, mes- 
sages release resources faster, then decreasing the probability of blocking other 
messages. Therefore, the use of non-minimal paths is also decreased. 

Let’s analyze the use of time thresholds in SP and MM. Figures 3-a and 3-b 
show some results for MM using simple timeouts for short and long messages, 
respectively. A threshold equal to zero is equivalent to use the basic selection 
function. As can be seen, using any threshold for short messages provides sim- 
ilar improvements. For long messages, the performance improvement is higher. 
Thresholds ranging from 3 to 6 obtain the best results. The ability of the selec- 
tion function to use the original channels only when necessary is what improves 
performance. Figures 4-a and 4-b show the results when using the deadlock de- 
tection mechanism to allow the use of original channels. In this case, a threshold 
equal to 0 is not the same as the basic selection function. Although the perfor- 
mance improvement is similar to the one obtained when using simple timeouts. 
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Fig. 4. Comparison of different thresholds with SP. Average message latency 
versus traffic for a network with 64 switches. The deadlock detection based mech- 
anism is used. Message length is 16 flits in (a) and 64 flits in (b) 





Traffic (flits/cycle/node) 



Traffic (flits/cycle/node) 



(a) 



(b) 



Fig. 5. Comparison of best results of MM and SP. Average message latency 
versus traffic for a network with 64 switches. Message length is 16 (a) and 64 
(b) flits 



the best threshold is the same for both message lengths, which simplifies network 
tuning. 

In Figure 5 we compare the best combinations of thresholds and selection 
functions in order to determine the selection function that achieves the high- 
est performance. For short messages (Figure 5-a), the best option is MMDet_4. 
This selection function reduces latency by 20% and increases throughput by 9% 
with respect to the basic MM function. Similar results are obtained for SPDet_8. 
These results are better than the ones obtained by the timeout-based selection 
functions. The reason is that the deadlock detection mechanism is more selective 
than timeouts for detecting network congestion. Thus, the use of original chan- 
nels is more restricted than when the timeout-based selection functions are used. 
For long messages (Figure 5-b), both using a simple timeout and using the dead- 
lock detection mechanism obtain similar improvements with respect to the basic 



On the Influence of the Selection Function 



299 



functions. MMDet_4 and SPDet_8 present slightly better results than the selec- 
tion functions based on timeouts. MMDet_4 improves latency about 30% and 
throughput about 9% with respect to MM, and message latency using SPDet_8 
is about 40% lower than the basic SP, improving throughput also by 9%. 

5 Conclusions 

In this paper we have evaluated the influence on network performance of the 
selection function executed at each switch in order to select one of the output 
virtual channels provided by the routing algorithm. To do so, we have compared 
several selection functions. Results obtained show variations in latency about 
20%, depending on the basic selection function implemented. In addition, we 
have extended the two best selection functions, MM and SP, in order to use time 
thresholds. Two alternatives, one based on simple timeouts and another based on 
a deadlock detection mechanism, have been analyzed. Finally, we have compared 
the best performance evaluation results, obtaining the best selection functions. 
Using a deadlock detection based mechanism improves network performance 
with respect to using a simple timeout, especially for short messages. On the 
other hand, the former can be tuned independently of message length. Results 
show that latency is reduced at the same time that throughput is increased about 
10% for both MM and SP. 
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Abstract. In previous papers we proposed the ITB mechanism to im- 
prove the performance of up*/down* routing in irregular networks with 
source routing. With this mechanism, both minimal routing and a better 
use of network links are guaranteed, resulting on an overall network per- 
formance improvement. In this paper, we show that the ITB mechanism 
can be used with any source routing scheme in the COW environment. 
In particular, we apply ITBs to DPS and Smart routing algorithms, 
which provide better routes than up*/down* routing. Results show that 
ITB strongly improves DFS (by 63%, for 64-switch networks) and Smart 
throughput (23%, for 32-switch networks). 



1 Introduction 

Clusters of workstations (COWs) are currently being considered as a cost- 
effective alternative for small-scale parallel computing. In these networks, to- 
pology is usually fixed by the location constraints of the computers, making it 
irregular. On the other hand, source routing is often used as an alternative to 
distributed routing, because non-routing switches are simpler and faster. 

Up*/down* [6] is one of the best known routing algorithms for irregular 
networks. It is based on an assignment of direction labels (“up” or “down”) 
to links. To eliminate deadlocks a route must traverse zero or more “up” links 
followed by zero or more “down” links. While up*/down* routing is simple, 
it concentrates traffic near the “root” switch and uses a large number of non- 
minimal paths. 

Other routing algorithms like Smart [2] and DFS [5] achieve better perfor- 
mance than up*/down*. Smart first computes all possible paths for every source- 
destination pair, building the channel dependence graph (CDG). Then, it uses 
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an iterative process to remove dependencies in the CDG taking into account a 
heuristic cost function. Although Smart routing distributes traffic better than 
other approaches, it has the drawback of its high computation overhead. DFS 
computes a depth-first spanning tree with no cycles. Then, it adds the remai- 
ning channels to provide minimal paths, breaking cycles by restricting routing. 
A heuristic is also used to reduce routing restrictions. 

These routing strategies remove cycles by restricting routing. As a conse- 
quence, many of the allowed paths are not minimal, increasing both latency and 
contention in the network. Also, forbidding some paths may result in an unba- 
lanced network traffic distribution, which leads to a rapid saturation. In this 
paper, we propose the use of a mechanism that removes channel dependences 
without restricting routing. This mechanism has been first proposed in [3] to 
improve up*/down* routing, but it can be applied to any routing algorithm. In 
this paper we will apply it to improved routing schemes (Smart and DFS). 

The rest of the paper is organized as follows. Section 2 summarizes how the 
mechanism works and its application to some optimized routing strategies. In 
Section 3, evaluation results for different networks and traffic load conditions 
are presented, analyzing the benefits of using our mechanism combined with 
previous routing proposals. Finally, in Section 4 some conclusions are drawn. 



2 Applying the ITB Mechanism to Remove Channel 
Dependences 

We will firstly summarize the basic idea of the mechanism. The paths between 
source-destination pairs are computed following any given rule and the corre- 
sponding CDG is obtained. Then, the cycles in the CDG are broken by splitting 
some paths into sub-paths. To do so, an intermediate host inside the path is 
selected and used as an in-transit buffer, (ITB); at this host, packets are ejected 
from the network as if it were their destination. The mechanism is cut-through. 
Therefore, packets are re-injected into the network as soon as possible to re- 
ach their final destination. Notice that the dependences between the input and 
output switch channels are completely removed because in the case of network 
contention, packets will be completely ejected from the network at the interme- 
diate host. The CDG is made acyclic by repeating this process until no cycles 
are found. Notice that more than one intermediate host may be needed. 

On the other hand, ejecting and re-injecting packets at some hosts also im- 
proves performance by reducing network contention. Packets that are ejected free 
the channels they have reserved, thus allowing other packets requiring these chan- 
nels to advance through the network (otherwise, they would become blocked). 
Therefore, adding some extra ITBs at some hosts may help in improving per- 
formance. Hence, the goal of the ITB mechanism is not only to provide minimal 
paths by breaking some dependences but also to improve performance by re- 
ducing network contention. However, ejecting and re-injecting packets at some 
intermediate hosts also increases the latency of these packets and requires some 
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additional resources in both network (links) and network interface cards (me- 
mory pools and DMA engines). 

If the rules used to build the paths between source-destination pairs lead to 
an unbalanced traffic distribution, then adding more ITBs than the ones strictly 
needed will help. This is the case for up*/down*, because this routing tends to 
saturate the area near the root switch. Thus, there is a trade-off between using 
the minimum number of ITBs that guarantees deadlock-free minimal routing 
and using more than these to improve network throughput. Therefore, when 
we apply the ITB mechanism to up*/down*, we will use these two approaches. 
In the first case, we will place the minimum number of ITBs that guarantees 
deadlock- free minimal routing. Thus, given a source-destination pair, we will 
compute all minimal paths. If there is a valid minimal up*/down* path it will 
be chosen. Otherwise, a minimal path with ITBs will be used. In the second 
approach, we will use more ITBs than strictly needed to guarantee deadlock-free 
minimal routing. In particular, we will randomly choose one minimal path. If the 
selected path complies with the up*/down* rule, it is used without modification. 
Otherwise, ITBs are inserted even if there exist valid minimal up*/down* paths 
between the same source-destination pair. 

In the case of DFS, we will use ITBs in the same way as in the second ap- 
proach used for up*/down* but verifying if the paths comply with the DFS rule. 
However, for Smart routing, we will use a different approach. We first compute 
the paths between source-destination pairs that better balance network traffic. 
Notice that the obtained routes are not the same that Smart computes, because 
it computes both balanced and deadlock-free routes whereas we compute only 
balanced routes. For this reason, we will refer to these routes as “balanced” rat- 
her than “smart” . Then, we compute the CDG and place ITBs to convert it into 
an acyclic one. On the other hand, since computing balanced routes alone is ea- 
sier than computing both balanced and deadlock- free routes, the computational 
cost of the resulting routing algorithm is lower than the one of Smart routing. 

3 Performance Evaluation 

3.1 Network Model and Network Load 

The network topologies we consider are irregular and have been generated ran- 
domly, imposing three restrictions: (i) all the switches have the same size (8 
ports), (ii) there are 4 hosts connected to each switch and (iii) two neighboring 
switches are connected by a single link. We have analyzed networks with 16, 32, 
and 64 switches (64, 128, and 256 hosts, respectively). 

Links, switches, and interface cards are modeled based on the Myrinet net- 
work [1]. Concerning links, we assume Myrinet short LAN cables [4] (10 meters 
long, 160 MB/s, 4.92 ns/m). Flits are one byte wide. Physical links are one flit 
wide. Transmission of data across channels is pipelined [7] with a rate of one flit 
every 6.25 ns and a maximum of 8 flits on the link at a given time. A hardware 
“stop and go” flow control protocol [1] is used to prevent packet loss. The slack 
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buffer size in Myrinet is fixed at 80 bytes. Stop and go marks are fixed at 56 
bytes and 40 bytes, respectively. 

Each switch has a simple routing control unit that removes the first flit of 
the header and uses it to select the output link. The first flit latency is 150 ns 
through the switch. After that, the switch is able to transfer flits at the link 
rate. Each output port can process only one packet header at a time. A crossbar 
inside the switch allows multiple packets to traverse it simultaneously. 

Each Myrinet network interface card has a routing table with one entry for 
every possible destination of messages. The tables are filled according to the 
routing scheme used. 

In the case of using ITBs, the incoming packet must be recognized as in- 
transit and the transmission DMA must be re-programmed. We have used a 
delay of 275 ns (44 bytes received) to detect an in-transit packet, and 200 ns (32 
additional bytes received) to program the DMA to re-inject the packet. These 
timings have been taken on a real Myrinet network. Also, the total capacity of 
the in-transit buffers has been set to 512KB at each interface card. 

In order to evaluate different workloads, we use different message destination 
distributions to generate network traffic: Uniform (the destination is chosen ran- 
domly with the same probability for all the hosts). Bit-reversal (the destination 
is computed by reversing the bits of the source host id.), Loeal (destinations are, 
at most, 5 switches away from the source host, and are randomly computed). 
Hot-spot (a percentage of traffic (20%, 15%, and 5% for 16, 32, and 64-switch 
networks, respectively) is sent to one randomly chosen host and the rest of the 
traffic randomly among all hosts) and a Combined distribution, which mixes 
the previous ones. In the later case, each host will generate messages using each 
distribution with the same probability. 

Packet generation rate is constant and the same for all the hosts. Although 
we use different message sizes (32, 512, and IK bytes), for the sake of brevity 
results will be shown only for 512-byte messages. 

3.2 Simulation Results 

First, we analyze the behavior of the routing algorithms without using in-transit 
buffers. Results for up*/down*, DFS and the Smart routing algorithms will be 
referred to as UD, DFS and SMART, respectively. Then, we evaluate the use 
of in-transit buffers over up*/down* and DFS routing. For up*/down* routing, 
we analyze the two approaches mentioned above: using the minimum number of 
ITBs needed to guarantee deadlock- free minimal routing (UD_MITB), and using 
more ITBs (UDJTB). For DFS routing, we only use the second approach, which 
will be referred to as DFSJTB. Finally, we evaluate the use of in-transit buffers 
over balanced but deadlocking routes supplied by the Smart routing algorithm. 
This routing will be referred to as BJTB (B from “balanced”). 

For each network size analyzed, we show the increase in throughput when 
using the in-transit buffer mechanism with respect to the original routing algo- 
rithms. Minimum, maximum, and average results for 10 random topologies are 
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Fig. 1. Average message latency vs. accepted traffic. Message length is 512 bytes. 
Uniform distribution. Network size is (a) 16 switches, (b) 32 switches, and (c) 
64 switches 



shown. In addition, we will plot the average message latency versus the accepted 
traffic for selected topologies. 



Routing Algorithms without ITBs Figure 1 shows the results for the uni- 
form distribution of message destinations for selected topologies of 16, 32, and 64 
switches, respectively. SMART routing is not shown for the 64-switch network 
due to its high computation time. 

As was expected, the best routing algorithm is SMART. It achieves the hig- 
hest network throughput for all the topologies we could evaluate. In particular, 
it increases throughput over UD and DFS routing by factors up to 1.77 and 1.28, 
respectively. 

The performance improvement achieved by SMART is due to its better traffic 
balancing. Figure 3. a shows the utilization of links connecting switches for the 
32-switch network. Links are sorted by utilization. Traffic is 0.03 ffits/ns/switch. 
For this traffic value, UD routing is reaching saturation. When using UD routing, 
half the links are poorly used (52% of links with a link utilization lower than 10%) 
and a few links highly used (only 11% of links with a link utilization higher than 
30%), some of them being over-used (3 links with a link utilization higher than 
50%). Traffic is clearly unbalanced among all the links. DFS routing reduces this 
un-balancing and has 31% of links with link utilization lower than 10% and 9% of 
links with link utilization higher than 30%. The best traffic balancing is achieved 
by SMART routing. For the same traffic value, links are highly balanced, link 
utilization ranging from 7.76% to 20.26% (76% of links with a link utilization 
between 15% and 20%) . As traffic is better balanced, more traffic can be handled 
by the SMART routing and, therefore, higher throughput is achieved. 



Routing algorithms with ITBs Figure 2 shows the performance results ob- 
tained by the UDJVIITB, UDJTB, DFSJTB and BJTB routing algorithms for 
the uniform distribution of message destinations for selected 16, 32, and 64- 
switch networks, respectively. Table 1 shows the average results for 30 different 
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Fig. 2. Average message latency vs. accepted traffic. UD, DFS, SMART, 
UD_MITB, UD JTB, DFSJTB, and B JTB routing. Message length is 512 bytes. 
Uniform distribution. Network size is (a) 16 switches, (b) 32 switches, and (c) 
64 switches 

Table 1. Factor of throughput increase when using in-transit buffers on the UD, 
DFS, and SMART routing. Uniform distribution. Message size is 512 bytes 





UDJVIITB vs UD 


UDJTB vs UD 


DFSJTB 


vs DFS 


B JTB vs 


SMART 


Sw 


Min 


Max 


Avg 


Min 


Max 


Avg 


Min 


Max 


Avg 


Min 


Max 


Avg 


16 


1.00 


1.29 


1.13 


1.00 


1.57 


1.29 


1.01 


1.20 


1.12 


1.00 


1.16 


1.07 


32 


1.16 


1.72 


1.46 


1.50 


2.14 


1.88 


1.25 


1.56 


1.41 


1.11 


1.33 


1.23 


64 


1.60 


2.25 


1.91 


2.20 


3.00 


2.57 


1.50 


1.85 


1.63 


N/A 


N/A 


N/A 



topologies. For 64-switch networks, Smart routes were not available due to its 
high computation time. 

Let us first comment on the influence of in-transit buffers on up*/down* and 
DFS. As can be seen, the ITB mechanism always improves network throughput 
over both original routing algorithms. Moreover, as network size increases, more 
benefits are obtained. In particular, UDJVIITB improves over UD by factors of 
1.12, 1.50, and 2.00 for 16, 32, and 64-switch networks, respectively. However, 
when more ITBs are used, more benefits are obtained. In particular, UDJTB 
improves over UD by factors of 1.22, 2.14, and 2.75 for the 16, 32, and 64-switch 
networks, respectively. Concerning DFS, DFSJTB routing improves network 
throughput over DFS by factors of 1.10, 1.39, and 1.54 for the same network 
sizes. 

Notice that UDJTB and DFSJTB achieve roughly the same network thro- 
ughput. These routing algorithms use the same minimal paths and the main 
difference between them is where the in-transit buffers are allocated and how 
many in-transit buffers are needed. Also, the DFSJTB routing exhibits lower 
average latency than UDJTB. This is because DFS routing is less restrictive 
than UD routing, and therefore, DFSJTB needs fewer ITBs on average than 
UDJTB. When using DFSJTB routing in the 64-switch network, messages use 
0.3 ITBs on average, while the average number of ITBs per message is 0.55 in 
UDJTB. This also explains the higher throughput achieved by UDJTB since 
messages using ITBs are removed from the network, thus reducing congestion. 
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(a) (a) (a) 

Fig. 3. Link utilization and link blocked time. Network size is 32 switches. Mes- 
sage size is 512 bytes. Uniform distribution, (a) Link utilization for original rou- 
tings. (b) Link utilization for routings with ITBs. (c) Blocked time for SMART 
and BJTB routing. Traffic is (a, b) 0.03 and (c) 0.05 ffits/ns/switch 

On the other hand, as network size increases, network throughput increases 
with respect to routing algorithms that do not use ITBs. UD and DFS routing 
are computed from a spanning tree and one of the main drawbacks of such an 
approach is that, as network size increases, a smaller percentage of minimal paths 
can be used. For the 16-switch network, 89% of the routes computed by UD are 
minimal. However, for 32 and 64-switch networks, the percentage of minimal 
routes goes down to 71% and 61%, respectively. When DFS routing is used, 
something similar occurs. There are 94%, 81%, and 70% of minimal routes for 
the 16, 32, and 64-switch networks, respectively. When using in-transit buffers, 
all the computed routes are minimal. 

Another drawback of routing algorithms computed from spanning trees is 
unbalanced traffic. As network size increases, routing algorithms tend to ove- 
ruse some links (links near the root switch) and this leads to unbalanced traffic. 
As in-transit buffers allow the use of alternative routes, network traffic is not 
forced to pass through the root switch (in the spanning tree) , thus improving net- 
work performance. Figure 3.b shows the link utilization for UD_MITB, UD JTB, 
DFS-ITB, and BJTB routing, respectively, for the 32-switch network. Network 
traffic is 0.03 ffits/ns/switch (where UD routing saturates). We observe that 
UD_MITB routing achieves a better traffic balancing than UD (see Figure 3. a). 
Only 33% of links have a link utilization lower than 10% and only 10% of links 
are used more than 30% of time. However, as this algorithm uses ITBs only 
when necessary to ensure deadlock-free minimal routing, a high percentage of 
routes are still valid minimal up*/down* paths, and therefore, part of the traffic 
is still forced to cross the root switch. UDJVIITB traffic balance is improved by 
UDJTB and DFSJTB. With the UDJTB routing, all links have a utilization 
lower than 30% and only 20% of links are used less than 10% of time. DFSJTB 
routing shows roughly the same traffic balance. 

Let us analyze now in-transit buffers with Smart routing. Smart routing is not 
based on spanning trees. Moreover, its main goal is to balance network traffic. 
In fact, we have already seen the good traffic balancing achieved by this routing 
algorithm (see Figure 3. a). Therefore, it seems that in-transit buffers will have 
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little to offer to Smart routing. However, we observe in Table 1 that for Smart 
routing, the in-transit buffer mechanism also increases network throughput (ex- 
cept for one 16-switch network where it obtains the same network throughput). 
For a 32-switch network, B JTB routing increases network throughput by a fac- 
tor of 1.33. 

In order to analyze the reasons for this improvement. Figure 3.b shows traffic 
balancing among all the links for BJTB routing at 0.03 ff its/ns/switch. As can 
be seen, it is very similar to the ones obtained by Smart (see Figure 3. a). The 
reason is that SMART routing is quite good in balancing traffic among all the 
links, and therefore, the in-transit buffer mechanism does not improve network 
throughput by balancing traffic even more. 

To fully understand the better performance achieved by BJTB routing we 
focus now on network contention. For this reason, we plot the link blocked time 
for both routing algorithms. Blocked time is the percentage of time that the 
link stops transmission due to flow control. This is a direct measure of network 
contention. Figure 3.c shows the link blocked time for a 32-switch network when 
using SMART and BJTB routing. Traffic is near 0.05 ffits/ns/s witch. We observe 
that Smart routing has some links blocked more than 10% of time and some 
particular links being blocked more than 20% of time. On the other hand, when 
using in-transit buffers, blocked time is kept lower than 5% for all the links for 
the same traffic point. 

In order to analyze the overhead introduced by ITBs, Table 2 shows the la- 
tency penalty introduced by in-transit buffers for very low traffic (the worst case). 
We show results for 512 and 32-byte messages. For 512-byte messages we ob- 
serve that, on average, the in-transit buffer mechanism slightly increases average 
message latency. This increase is never higher than 5%. The latency increase is 
only noticeable for short messages (32 bytes) . In this case, the maximum latency 
increase ranges from 16.66% to 22.09% for UDJTB. The explanation is simple. 
The ITBs only increase the latency components that depend on the number of 
hops. Therefore, short messages suffer a higher penalty in latency. Additionally, 
the latency penalty depends on the number of ITBs needed to guarantee dead- 
lock freedom. This is also shown in Table 2 where average latency penalty is 
lower when using ITBs with Smart, DFS or the minimum number of ITBs with 
UD (UDJVIITB). Finally, the latency overhead incurred by ITBs is partially 
offset by the on-average shorter paths allowed by the mechanism. 

Table 3 shows the factor of throughput increase for the hot-spot, bit-reversal, 
local, and combined traffic patterns. We observe that the in-transit buffer mecha- 
nism always increases, on average, network throughput of UD and DFS routing. 
In particular, when the combined traffic pattern is used, UDJTB improves over 
UD by factors of 1.26, 1.65, and 2.31 for 16, 32, and 64-switch networks, respecti- 
vely. Also, DFSJTB improves over DFS by factors of 1.14, 1.35, and 1.56 for 16, 
32, and 64-switch networks, respectively. Finally, BJTB increases, on average, 
network throughput by a factor of 1.14 for 32-switch networks. 

We conclude that, by using in-transit buffers on all the routing schemes 
analyzed, network throughput is increased. As network size increases, higher 
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Table 2. Percentage of message latency increase for very low traffic when using 
in-transit buffers on UD, DFS, and SMART routing. Uniform distribution 







UDJVfITB 


vs UD 


UD JTB vs UD 


DFSJTB 


vs DFS 


BJTB vs SMART 


Msg. 

size 


Sw 


Min 


Max 


Avg 


Min 


Max 


Avg 


Min 


Max 


Avg 


Min 


Max 


Avg 


512 


16 


-0.24 


0.76 


0.20 


1.01 


2.63 


1.69 


0.22 


0.74 


0.57 


-0.20 


2.22 


1.64 


512 


32 


0.26 


2.42 


1.24 


2.31 


4.20 


3.33 


0.90 


1.08 


0.93 


1.78 


2.95 


2.34 


512 


64 


-3.85 


-1.03 


-2.22 


-1.23 


1.46 


0.31 


0.67 


1.27 


1.02 


N/A 


N/A 


N/A 


32 


16 


0.80 


3.41 


2.29 


8.52 


16.66 


10.52 


1.56 


5.50 


3.49 


-0.35 


10.18 


7.77 


32 


32 


2.33 


7.57 


5.32 


13.16 


18.07 


16.44 


6.09 


7.28 


6.59 


9.26 


13.28 


11.00 


32 


64 


1.40 


5.97 


3.64 


11.69 


22.09 


16.87 


6.44 


8.56 


7.64 


N/A 


N/A 


N/A 



Table 3. Factor of throughput increase when using in-transit buffers on the UD, 
DFS, and SMART routing for different traffic patterns. Message size is 512 bytes 





UD_MITB vs UD 


UDJTB vs UD 


DFSJTB 


vs DFS 


BJTB vs 


SMART 


Distrib. 


Sw 


Min 


Max 


Avg 


Min 


Max 


Avg 


Min 


Max 


Avg 


Min 


Max 


Avg 


Hot-spot 


16 


0.99 


1.17 


1.04 


0.99 


1.21 


1.10 


1.00 


1.17 


1.05 


0.85 


1.17 


0.96 


Hot-spot 


32 


1.00 


1.40 


1.18 


1.00 


1.39 


1.18 


0.98 


1.17 


1.03 


1.00 


1.00 


1.00 


Hot-spot 


64 


1.60 


2.08 


1.71 


1.66 


2.57 


2.03 


1.21 


1.49 


1.35 


N/A 


N/A 


N/A 


Bit-rev. 


16 


0.94 


1.44 


1.16 


0.87 


1.81 


1.17 


0.79 


1.27 


1.03 


0.73 


1.13 


0.93 


Bit-rev. 


32 


1.12 


2.00 


1.59 


1.56 


2.57 


1.87 


1.20 


1.99 


1.51 


0.99 


1.45 


1.21 


Bit-rev. 


64 


1.74 


2.99 


2.05 


2.21 


3.50 


2.76 


1.46 


2.20 


1.78 


N/A 


N/A 


N/A 


Locai 


16 


0.97 


1.26 


1.08 


1.02 


1.56 


1.24 


1.00 


1.30 


1.17 


1.00 


1.17 


1.10 


Locai 


32 


1.00 


1.40 


1.16 


1.12 


1.60 


1.44 


1.15 


1.45 


1.29 


1.10 


1.29 


1.17 


Locai 


64 


1.00 


1.20 


1.07 


1.40 


1.57 


1.49 


1.13 


1.33 


1.24 


N/A 


N/A 


N/A 


Combined 


16 


1.00 


1.45 


1.15 


1.00 


1.56 


1.26 


0.98 


1.28 


1.14 


1.00 


1.17 


1.06 


Combined 


32 


1.12 


1.57 


1.31 


1.31 


1.86 


1.65 


1.20 


1.50 


1.35 


1.04 


1.27 


1.14 


Combined 


64 


1.48 


2.00 


1.74 


1.82 


2.65 


2.31 


1.43 


1.80 


1.56 


N/A 


N/A 


N/A 



improvements are obtained. In-transit buffers avoid congestion near the root 
switch (in the tree-based schemes), always provide deadlock- free minimal paths 
and balance network traffic. On the other hand, average message latency is 
slightly increased, but this increase is only noticeable for short messages and 
small networks. 



4 Conclusions 

In previous papers, we proposed the ITB mechanism to improve network per- 
formance in networks with source routing and up*/down* routing. Although 
the mechanism was primarily intended for breaking cyclic dependences between 
channels that may result in a deadlock, we have found that it also serves as 
a mechanism to reduce network contention and better balance network traffic. 
Moreover, it can be applied to any source routing algorithm. 

In this paper we apply the ITB mechanism to up*/down*, DFS, and Smart 
routing schemes, analyzing its behavior in detail using up to 30 randomly gene- 
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rated topologies, different traffic patterns (uniform, bit-reversal, local, hot-spot, 
and combined), and network sizes (16, 32, and 64 switches). Network design 
parameters were obtained from a real Myrinet network. 

Results show that, the in-transit buffer mechanism improves network per- 
formance for all the studied source routing algorithms. Up*/down* routing is 
significantly improved due to the many routing restrictions that it imposes and 
the unbalanced traffic nature of the spanning trees. Better source routing algo- 
rithms, like DFS and Smart, are also improved by the ITB mechanism. Finally, 
we have observed that as more ITBs are added to the network, throughput in- 
creases but the latency also increases due to the small penalty of using in-transit 
buffers. Therefore, there is a trade-off between network throughput and message 
latency. Thus, network designers have to decide on the appropriate number of 
ITBs depending on the application requirements. 

As for future work, we plan to implement the proposed mechanism on an 
actual Myrinet network in order to confirm the obtained simulation results. 
Also, we are working on techniques that reduce ITB overhead. 
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Abstract. Caches do not grow in size at the speed of main memory 
or raw processor performance. Therefore, optimal use of the limited 
cache resources is of paramount importance to obtain a good system 
performance. Instead of a recency-based replacement policy (such as, 
e.g., LRU), we can also make use of a locality-based policy, based on the 
temporal reuse of data. 

These replacement policies have usually been constrncted to operate in 
a cache with multiple modules, some of them dedicated to data showing 
high temporal reuse, and some of them dedicated to data showing low 
temporal reuse. 

In this paper, we show how locality-based replacement policies can be 
adapted to operate in set-associative and skewed-associative [8] caches. 
In order to understand the benefits of locality-based replacement policies, 
they are compared to recency-based replacement policies, something that 
has not been done before. 



1 Introduction 

Trends in microprocessor development indicate that microprocessors gain in 
speed much faster than main memory. This discrepancy is called the memory 
gap. The memory gap can be hidden using multiple levels of cache memories. 
But even then, the delays introduced by the caches and main memory are be- 
coming so large, that the memory hierarchy remains a bottleneck in processor 
performance. 

An important part of the cache design is the replacement policy, which de- 
cides what data may be evicted from the cache. A recent approach to better 
replacement policies is using locality properties of the memory reference stream. 
In studies of such replacement policies, cache organisations consisting of multiple 
cache modules are used. Each module is a conventional cache and is dedicated 
to data with a specific locality type. A typical organisation is a direct mapped 
cache dedicated to data exhibiting temporal locality combined with a smaller 
and fully associative cache for data with non-temporal or highly spatial locality. 

Such a cache organisation poses serious design problems. Since data can be 
found in either module, a multiplexer is needed to select the data from one of the 
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modules, increasing the cache lookup time. Furthermore, a direct mapped cache 
has an inherently shorter access time than a fully associative cache. It is not 
always possible to find two modules with the same access time. This results in 
an unbalanced design with one module in the critical path. Because of these diffi- 
culties, we propose to use locality-sensitive replacement policies in simple cache 
organisations, including the set-associative and the skewed-associative cache. 
However, applying locality-sensitivity to set-associative caches is not straight- 
forward, because the operation of these replacement policies is closely interwo- 
ven with the organisation of a multi-module cache. The locality type of a block 
is derived from the module it is stored in. Therefore, we propose to label the 
blocks with their locality type, so that the replacement policy can make use of 
this information. 

This paper is organised as follows. In section 2, we describe the various cache 
organisations. Section 3 describes replacement policies and their extension to 
set-associative and skewed-associative caches. In section 4 we present simulation 
results. Section 6 discusses related work and section 7 summarises the main 
conclusions. 

2 Cache Organisations 

The most wide-spread cache organisation is that of a set-associative cache. In 
a set-associative cache, memory blocks are mapped to cache sets by extracting 
bits from the block number (Figure 1(a)). An n-way set-associative cache can 
contain n blocks from the same set. If n is 1, the cache is called direct mapped. 
When there is only one set in the cache, the cache is called fully associative. 
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Fig. 1. Three cache organisations 



In a multi-module cache, multiple cache modules operate in parallel (Fig- 
ure 1(b)). Each cache module can be thought of as a conventional cache, pos- 
sibly having different associativity, block size, etc. A memory request is sent to 
all cache modules simultaneously, so extra combining logic is needed to obtain 
the result from the correct module. If data can only be cached in one module, 
then the combining logic consists of a multiplexer which selects the data from 
the module with a cache hit. 
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We study multi-module caches with two modules having the same block size. 
In the remainder of this paper, we call the cache modules A and B. The sets in 
module A and module B are called A-sets and B-sets, respectively. 

A skewed-associative cache is a multi-bank cache. Each bank is indexed by 
a different set index function [8]. Furthermore, the index functions are designed 
in such a way that blocks are dispersed over the cache. When two blocks map 
to the same frame in one bank, then they do not necessarily map to the same 
frames in the other banks. An n-way skewed-associative cache can be modelled 
as a multi-module cache, where each module is direct mapped and corresponds 
to a bank of the skewed-associative cache. 

3 Replacement Policies 

In this section we describe the recency-based replacement policies and the 
locality-based replacement policies. 



3.1 Recency-Based Replacement Policies 

The least recently used (LRU) policy is commonly used in set-associative caches. 
Since the complexity of the LRU algorithm scales as the square of the number 
of blocks involved [11], it would be impractical to implement it for multi-module 
caches. Other recency-based algorithms are thus needed. We discuss the not 
recently used and the enhanced not recently used policies, which have been con- 
structed for skewed-associative caches [8,9,10]. 

In the not recently used policy (NRU) a one bit tag is associated with every 
block in the cache. The tag bit is asserted every time the block is accessed and it 
signals that the block is young. The NRU policy also requires a (global) counter, 
which keeps track of the number of young blocks in the cache. When the counter 
reaches a certain threshold,^ all blocks in the cache have their tag bit reset and 
the counter is reset as well. The NRU policy selects its victim randomly among 
all old blocks in the cache. If there are no old blocks, it selects a block randomly 
among the young blocks. 

The enhanced not recently used policy (ENRU) is an improvement of the 
NRU policy. It uses two tag bits for each cache block and divides the cache 
blocks into three categories: very young, young and old. The ENRU policy selects 
a victim block at random, first among the old blocks, then the young blocks and, 
if necessary, among the very young blocks. 

3.2 Locality-Based Replacement Policies 

Several locality-sensitive replacement policies have been proposed in the litera- 
ture. We focus on the non-temporal streaming cache and the allocation by conflict 

^ It is reported that a good threshold is half the size of the cache, expressed as a 
number of blocks [9] . 
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policy. In these cache organisations, each module is dedicated to data exposing a 
specific type of locality. Hence, the replacement policy can be decomposed into 
two consecutive steps: (1) determining the locality type of the data and thus the 
module and (2) selecting a victim from the module e.g., using LRU. 

The non-temporal streaming (NTS) cache [5], consists of three units: a tem- 
poral module A, a non-temporal module B and a locality detection unit. When 
a block is placed in the cache, it is placed in either module A or module B, de- 
pending on the locality properties reported by the locality detection unit. Blocks 
exposing mainly temporal locality are placed in module A, the other blocks in 
module B. 

A block exposes temporal locality when at least one word in the block is 
used at least twice between loading the block and evicting it from the cache. 
Each block in the cache has one locality bit associated with it and each word in 
the cache has a reference bit. When a block is loaded into the cache, the block’s 
locality bit is set to zero, indicating non-temporal use, and the reference bit of 
the requested word is set to one. If the reference bit of the requested word is one 
on a cache hit, then the locality bit is also set to one, indicating temporal use. 

The detection unit is a small fully-associative cache, where the locality bits 
of evicted blocks are saved. In our implementation, only non-temporal blocks are 
stored in the detection unit’s cache, since missing blocks have temporal locality. 

In the allocation by conflict (ABC) policy [13], blocks are locked in a direct 
mapped cache until the number of misses exceeds the number of references to 
this block. The ABC policy adds a conflict bit to each block in module A. The 
conflict bit is set to zero when a reference is made to the associated block. It is 
set to one on each cache miss which maps into the same A-set. In each module, 
there is a candidate block for replacement, selected by LRU policies. One of 
these blocks is then selected using the conflict bit of the block in module A. 

3.3 Locality-Sensitive Replacement Policies for Skewed- and 
Set- Associative Caches 

A locality-sensitive replacement policy has to know the locality properties of the 
data in the cache, so we label each block with its type of locality. We use the 
not recently used (NRU) policy to account for some aging effect. The policies 
work similar to NRU. They define several categories of blocks and search them 
in their listed order. 

The temporality-based NRU policy (TNRU) is a combination of NRU and 
NTS. The locality properties of the data are defined and detected in exactly the 
same way as in NTS. The recency ordering between cache blocks is maintained 
using the NRU policy. The TNRU policy can distinguish between four categories 
of blocks: old and non-temporal, old and temporal, young and non-temporal and 
young and temporal. The temporal properties of the block are decided using the 
locality information at the time of loading the block in the cache. 

The second replacement policy we propose is the conflict based NRU pol- 
icy (CNRU), based on the ABC policy. Each block in the cache has a conflict 
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bit, which is managed in the same way as in ABC. The CNRU policy distin- 
guishes between four categories of cache blocks, namely those that are old and 
not proven, young and not proven, old and proven and young and proven. A 
block is proven when its conflict bit is one. 

4 Experimental Evaluation 

We evaluated the performance of the replacement policies in three cache organ- 
isations: a multi-module cache, a skewed-associative cache and a set-associative 
cache. All caches have 32 byte blocks and use demand fetching. The skewed- 
associative and set-associative cache are both 8 kB large and have associativity 
two. The skewed-associative cache uses the index functions defined in [10]. 

The multi-module cache was chosen such that the modules have approxi- 
mately the same cycle time. We used the cacti model [14] to obtain access times 
of several cache organisations in a 0.13/im technology. A fully associative 1 kB 
module has a 1.3 ns cycle time. To match this cycle time, the other module 
should be either 8 (1.23 ns) or 16 kB (1.32 ns) large and 2-way set-associative. 
Another possibility is to use a very large direct mapped cache, e.g. a 64 kB cache 
(1.36ns). We choose to combine an 8kB 2-way set-associative cache with a IkB 
fully associative cache. 



Table 1. Miss ratios of the LRU policy in the different cache organisations 



SPECfp 


MM 


SA 


SK 


SPECint 


MM 


SA 


SK 


applu 


0.071 


0.074 


0.074 


compress 


0.045 


0.051 


0.049 


apsi 


0.048 


0.085 


0.040 


gcc 


0.044 


0.063 


0.048 


fpppp 


0.012 


0.024 


0.023 


go 


0.010 


0.032 


0.020 


hydro2d 


0.179 


0.178 


0.180 


ijpeg 


0.021 


0.043 


0.024 


mgrid 


0.054 


0.054 


0.055 


li 


0.038 


0.045 


0.041 


su2cor 


0.049 


0.051 


0.051 


mSSksim 


0.011 


0.021 


0.013 


swim 


0.080 


0.609 


0.091 


perl 


0.015 


0.035 


0.027 


tomcatv 


0.153 


0.453 


0.151 


vortex 


0.022 


0.045 


0.028 


turbSd 


0.061 


0.083 


0.050 










wave5 


0.164 


0.316 


0.167 











We collected traces of all SPEC95 benchmarks using ATOM [12]. The number 
of memory references in each trace was limited to 300 million, taken from the 
middle of the program. 

Miss ratios are used as the performance measure. However, since miss ratios 
vary greatly from program to program, the miss ratios are divided by the miss 
ratio of the LRU policy in the same cache organisation. Table 1 contains the miss 
ratios of the LRU policies in the different cache organisations, for reference. MM 
is the multi-module cache, SA is the set-associative cache and SK is the skewed- 
associative cache. 
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5 Discussion of the Results 

For the multi-module cache, the most eye-catching result is that the locality- 
based replacement policies have very bad performance for some benchmarks 
(Figure 2). In one case, the miss ratio is increased with 175% (ABC for the 
benchmark tomcatv). In other cases, the miss ratio of a benchmark can be in- 
creased by as much as 5 or 10%. The replacement policies usually have a miss 
ratio that is worse than that of the LRU policy. For the SPECfp benchmarks, the 
ENRU and CNRU policies perform the best. These policies also closely follow 
the miss ratio of the LRU policy. 





Fig. 2. Relative performance of replacement policies in the 8 kB two-way set- 
associative and 1 kB fully-associative multi-module cache for SPECfp (left) and 
SPECint (right) 



For the SPECint benchmarks, the CNRU and ABC benchmarks provide the 
best results. In contrast to what happens for the SPECfp benchmarks, all re- 
placement policies sometimes perform 10 to 20% worse than the LRU policy. 
However, this unexpected behaviour is not as bad as it is for the SPECfp bench- 
marks. Furthermore, the CNRU policy generally works better than the ABC 
policy, on which it is based. The same relation holds between TNRU and NTS. 

Figure 3 shows the results for the set-associative cache. The ENRU policy has 
about the same performance as the LRU policy, while it has a larger cost. For 
some benchmarks, the TNRU policy provides a big improvement with respect to 
the LRU policy (e.g. 13% for fpppp and 8.6% for perl). On the average, TNRU 
performs about 1% better than LRU for SPECint and SPECfp. 

In the skewed-associative cache, the ENRU policy usually performs worse 
than the LRU policy, except for fpppp (Figure 4). This was also reported in [10]. 
The locality-sensitive policies perform really well for the benchmark fpppp and 
they also work better than the ENRU policy for most benchmarks. Overall, the 
miss ratio of TNRU is about 2% lower than that of ENRU, both for SPECint 
and SPECfp. However, neither CNRU nor TNRU perform better on average 
than the LRU policy. 
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Fig. 3. Relative performance of replacement policies in the 8 kB two-way set- 
associative cache, for SPECfp (left) and SPECint (right) 
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Fig. 4. Relative performance of replacement policies in the 8 kB skewed- 
associative cache, for SPECfp (left) and SPECint (right) 



6 Related Work 

Many different locality-sensitive replacement policies and accompanying cache 
organisations have been proposed. The NTS cache was introduced in [4] and was 
slightly changed in [.5] . The main difference between these two versions is the way 
the locality properties of evicted blocks are remembered. Our implementation is 
based on [.5]. 

The dual data cache [2] dedicates one module to data with high temporal lo- 
cality, while the other module caches data with high spatial locality. The locality 
properties are detected by treating all fetched data as vectors. The stride and 
vector length is measured and is used to define three types of locality: non- vector 
data, short vectors and self-interfering vectors. The latter type is not cached at 
all. The speedup of the dual data cache is largely due to selective caching [2]. 
Alternatively, a compiler can detect the stride and vector length, as well as self- 
and group-reuse [6,7]. 
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Several processors have implemented multi-module caches. The data cache of 
the HP PA-RISC 7200 consists of a large direct mapped cache and a small fully 
associative cache [1]. The purpose of the fully associative cache is to decrease 
the number of conflicts in the direct mapped cache. 

Another approach is taken in the UltraSPARC III, which has a multi-module 
LI and L2 data caches [3]. These caches are managed by splitting the reference 
stream, not on the basis of locality properties, but on the origin of the transfers 
between the caches. 

7 Conclusions 

We discussed the problems associated with applying locality-sensitive replace- 
ment policies to set-associative and skewed-associative caches. We extended two 
replacement policies from literature by labelling each block with its locality type. 

We compared the locality-sensitive replacement policies to recency based 
replacement policies like LRU and ENRU. Overall, we find that the locality- 
sensitive replacement policies have approximately the same performance as re- 
cency based policies. Recency-based replacement policies can manage a multi- 
module cache as good as or better than locality-sensitive policies. Furthermore, 
the locality-based policies are mostly suited for the SPECfp benchmarks al- 
though they show very poor behaviour for some benchmarks. In contrast, the 
recency-based policies are more well-behaved. 

For set-associative caches, the locality-based replacement policy TNRU de- 
creases the miss ratio slightly over LRU (with about 1%). In a skewed-associative 
cache, the TNRU policy provides a 2 % improvement over the ENRU policy. 
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Abstract. Reeent eaehe researeh has mainly foeussed on how to split 
the first-level data eaehe. This paper eoneentrates on the redesign of the 
filter data eaehe scheme, presenting some improvements on the first 
version of the scheme. A performance study compares these proposals 
with other organizations that split caches according to the criterion of 
data localities. The new filter cache schemes exhibit better perform- 
ances than the other compared solutions. An 18 KB organization offers 
a block management capacity equivalent to a conventional 28 KB 
cache. 



1 Introduction 

Recent research [1-12] has focused on optimizing the first-level (LI) data cache or- 
ganization in order to increase the LI hit ratio and reduce this critical time. Usually, 
the proposed models classify the data lines in two independent sets; according to a 
predefined characteristic exhibited by the data. To improve performance, both types of 
data are then cached and treated separately in caches with independent organizations. 
For this purpose, the LI cache is usually split into two parallel caches also called 
subcaches because both make up the first level and each caches one type of predefined 
data line. The main advantage of having two independent subcaches is that it is possi- 
ble to tune each specific organization (cache size, associativity, and block size), and 
replacement algorithm, according to the characteristics of the data. The criterion of 
data locality prevails among the schemes that split the data cache. More information 
on this subject can be found in references [17,18,19]. 
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The filter data cache [6] is a scheme that splits the first level data cache into two 
subcaches, the smaller cache filters the most strongly referenced blocks, and is con- 
nected by a unidirectional data path to the larger cache. The datapath is used to move 
blocks to the larger subcache when a block is replaced from the smaller subcache. 

This paper presents new approaches for the filter data cache scheme, and compares 
performance with two recent schemes [3,4]. 



2 Existing Solutions 

Two earlier proposals for handling spatial and temporal localities in separate 
caches were introduced in [1, 2]; however, both localities can appear together, or not 
appear at all. In the STS scheme [3] the cache is split by giving priority to temporal 
locality, and the lines exhibiting some temporal locality are cached together in a large 
organization, while the lines that do not exhibit temporal locality are cached sepa- 
rately. In [4] the other extreme is performed by giving priority to the spatial locality. 
In [5] the first level data cache is split into three organizations, one caches the lines 
exhibiting temporal and spatial localities together, another caches lines exhibiting only 
spatial locality, and the final organization only caches lines showing temporal locality. 

Schemes not designed to exploit data localities in independent caches have also 
been proposed. The Assist cache [7] tries to reduce the conflict misses by adding a 
small fully associative cache. The Victim cache [8] tries to retain the most recent 
conflict lines in a small cache between the first level and the second level of the mem- 
ory hierarchy. The Allocation By Conflict scheme [9] tries to take replacement deci- 
sions based on the behavior of the conflict block allocated in the “main subcache". To 
avoid introducing pollution into the cache, some schemes propose bypassing the cache 
lines that are infrequently referenced [10,1 1,13]. Other schemes propose caching them 
in a small bypass buffer [11]. In [12], the data cache is split according to the type of 
data scalar or array. 

In those schemes managing reuse information, a line in the first level cache uses a 
hardware mechanism to gather information about the behavior of a block in cache 
(current information). When the line is replaced from the first level cache, this infor- 
mation is flushed to the L2 cache (or to another structure in the first level), and then 
used when the block is again referenced (reuse information) to decide in which first 
level cache the line must be placed. In general, the schemes have two caches at the 
first level, the larger one, or "main cache", and a smaller cache that usually works as 
an assistant to improve performance. The reuse information is reset (lost) when a line 
is removed from the second level cache. In addition, some schemes introduce a data- 
path connecting both first-level caches. 



2.1 Non Temporal Streaming Cache (NTS) 

In the NTS cache [3] proposed by Rivers et al. the data is dynamically tagged as tem- 
poral or non-temporal. The model shows a large temporal cache placed in parallel 
with a small non-temporal cache. Each line in the temporal cache has a reference bit 
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array attached, in addition to a non-temporal (NT) bit. When a block is placed in the 
temporal cache, each bit in the reference bit array is reset; and the NT bit is set. When 
a hit occurs in this cache, the bit associated with the accessed word is set. If the bit 
was already set (meaning that the word had already been accessed), the NT bit is reset 
to indicate the line showing temporal behavior. When a line is removed from the first 
level cache its NT bit flushes to the second level cache. If the line is referenced again, 
this bit is checked to decide where it must be placed. 



2.2 The Split Spatial/Non-Spatial (SS/NS) 

The SS/NS cache was proposed by Prvulovic et al. [4]. This scheme makes a division 
between spatial and non-spatial data lines, giving priority to lines exhibiting spatial 
locality. The model introduces a large spatial cache in parallel with a non-spatial 
cache that is four times smaller. The spatial cache exploits both types of spatial local- 
ity (only spatial, or both spatial and temporal). Line size in the non-spatial cache is 
just one word; thus, only temporal locality can be exploited in this cache. In the spatial 
cache, the line size is larger (four words). 

The spatial cache uses a prefetch mechanism to assist this type of locality. A hard- 
ware mechanism is introduced to recompose lines in the non-spatial cache and move 
them (by a unidirectional data path) to the spatial cache. It uses a reference bit array 
similar to the one incorporated in the NTS scheme to tag lines, which are tagged as 
spatial if more than two bits are set; otherwise, they are tagged as non-spatial. 



2.3 The Filter Data Cache 

The filter data cache was introduced in a previous paper [6]. The model presents a 
very small direct-mapped “filter” cache in addition to a large “main cache” in the first 
level. The scheme tries to identify the most heavily referenced lines and places them 
together in the small “filter cache”. 

Each cache line has a 4-bit attached counter showing the number of times that each 
cache line is referenced. When the access results in a hit in any subcache, the counter 
is increased. If the access results in a miss in both subcaches, the counter of the refer- 
enced line is compared with the counter of the conflict line in the filter cache to decide 
in which subcache the referenced line will be placed. If the counter of the referenced 
line is less than the conflict in the filter cache, then the miss line is placed in the main 
cache. Otherwise, the model assumes that the miss line is more likely to be referenced 
again than the conflict line, and so it is placed in the filter cache. As the lines in the 
filter cache have shown a high frequency reference, when they evict the cache they 
move by using a unidirectional datapath to the main cache to spend more time in the 
first-level. The counter value (four bits) is the only information flushed to the L2 
cache when lines are evicted from the first level cache. 

In this work, we add associativity to the small filter cache. So, when a conflict oc- 
curs, the line with the lower counter in the set will be compared against the referenced 
block counter. The result of the comparison decides in which subcache the referenced 
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block must be placed. To avoid blocks with low, or non-temporal locality, moving to 
the filter cache, a minimum counter threshold is established. A threshold greater than 
zero means that the defect cache is the "main cache". From time to time, each line 
counter is shifted to ensure that lines with good temporal locality during just a phase 
of their execution remain in the filter caehe when their temporality drops. 

If an application has many blocks showing high temporal locality, then the blocks 
will be quickly replaced from the filter cache ending up in the main cache. In such 
specific situations, the effeetiveness of the filter is very poor. To improve performance 
in such cases, a bi-directional datapath connecting the first-level caehes is introduced 
to allow line swapping. The result of adding this feature is a different model called the 
Filter Caehe Swap. For a performance evaluation study we consider two approaches, 
one of them incorporating the swapping mechanism. 



3 Assumptions and Conditions of the Analysis 

All organizations in the cache hierarchy are two-way set associative with a 256 bits 
(eight 32-bit words) block size; the "main cache" capacity is 16 KB and the "small 
caehe" capacity is 2 KB. Due to traditional mapping function restrictions, cache ca- 
pacities must be a power of two. Thus, if we wish to compare performances of the 
proposed organizations having 18 KB (16KB plus 2 KB) capacities with the conven- 
tional ones, we have several options. One commonly adopted solution is to compare 
performances against a 16 KB conventional caehe which has similar capacity [9,16]. 
In this paper, we estimate the theoretically equivalent capacity of a conventional caehe 
offering the same performance. To do this, we assume that performances between 16 
KB and 32 KB behave linearly, and we estimate the point at whieh the performance 
would be. Thus, we compare performances among the splitting data cache schemes 
and present results with a conventional 16 KB caehe used as baseline scheme, as well 
as a larger 32 KB conventional organization. The L2 caehe is 256 KB. A data bus 
with a line-size width is assumed between the first level and the second level cache. 
All caches are two-way set associative. 



4 Simulation Analysis 

4.1 Experimental Framework and Benchmarks 

Performance results have been obtained using the execution driven simulator 
LIMES [15] and several suites of benchmarks. We selected five SPLASH-2 beneh- 
marks (FFT, Radix, FMM, LU and Barnes), the compress benchmark from the SPEC 
suite, and the two benchmarks (MM and Jacobi) discussed in [15]. The problem size 
in all the selected benchmarks exceeded 60M of memory references, except the 
benchmark Compress that was run using training data inputs. 

The Tour of a Line is defined [16] as the interval of time from when the line is 
placed in the first level cache until the line is evicted from that level. The number of 
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tours and their lengths, measured in mean number of accesses that hit the line while it 
is in a tour are used to evaluate the effectiveness of cache data management. 



4.2 Hit Ratio and Tour Analysis 

Table 1 shows the LI miss ratio of the split data cache schemes, the 16 KB baseline 
conventional cache and the larger 32 KB cache. We are not interesting in comparing 
performances of the data splitting cache schemes with the performance offered by the 
larger classic cache, near twice their capacity. We only wish to show that perform- 
ances obtained by the splitting schemes are closer to the large traditional cache than to 
the small, and similarly sized, traditional cache. 

All the data splitting cache models improve the miss ratio of the conventional 16 
KB cache, except the SS/NS model in the FMM benchmark. The filter scheme gives 
the best performance among the splitting data cache models, except for the Jacobi 
application where the NTS model performs better. This result is explained by the fact 
that while the data lines in this application exhibit high temporal locality; they also 
have bad tour persistence. In other words, some lines that are tagged as non-temporal 
at the end of a tour, start their next tour in the non-temporal cache, and then unexpect- 
edly exhibit temporal locality. Consequently, despite this undesirable behavior, the 
NTS scheme makes excellent use of both subcaches in Jacobi. 



Table 1. Miss ratio (%) of the schemes with an 8 words block size 



Bencmark 


16 KB 


FILTER 


FILTERswap 


NTS 


SS/NS 


32 KB 


Barnes 


2.42 


1.65 


1.76 


1.86 


2.11 


0.98 


FMM 


0.79 


0.63 


0.65 


0.71 


1.18 


0.51 


LU 


1.23 


0.93 


1.08 


1.19 


1.23 


1.15 


FFT 


3.61 


3.59 


3.60 


3.52 


3.60 


3.17 


RADIX 


1.29 


1.29 


1.29 


1.29 


1.29 


1.29 


MM 


10.1 


10.1 


9.94 


10.1 


10.07 


10.1 


Jacobi 


18.7 


13.3 


13.54 


4.40 


18.73 


18.7 


Compress 


3.12 


2.91 


2.92 


3.01 


3.03 


2.17 


Average 


3.23 


3.01 


3.03 


3.10 


3.22 


2.76 



The results also show that better miss ratios are achieved in the filter scheme if the 
swapping mechanism is disabled, with the single exception of the MM application. 
The differences in the Radix kernel among the schemes are negligible because of the 
high data localities exhibited by these benchmarks. The miss ratio of the filter scheme 
is not only the best among the splitting models, but it also achieves a better miss ratio 
in LU and Jacobi than the larger conventional cache with nearly twice its capacity. 

The lower the number of tours, the better is the effectiveness of a cache scheme. 
Table 2 shows the number of tours in the 16 KB conventional cache, and the percent- 
age of reduction in tours offered by the other schemes. A negative value means that 
the number of tours increases with respect to the 16KB conventional cache. 
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The results show that the filter scheme is the 18 KB organization that most reduces 
the number of tours, showing significant improvements in four of the eight bench- 
marks used (Barnes, FMM, LU, and Compress). Its reduction sometimes doubles that 
obtained by the NTS and SS/NS schemes; furthermore, in some cases (LU and Jacobi) 
its reduction is better than the large classic cache. The only exception appears in Ja- 
cobi, where the NTS scheme shows better results than the filter scheme. 

From data in Table 2 we estimated the theoretically equivalent conventional cache 
capacity as explained before, and show these results in Table 3. The filter cache 
scheme with 18 KB offers tour reductions equivalent to those offered by a theoreti- 
cally conventional cache with a capacity of 28 KB. The filter swap and the NTS also 
offer good results equivalent to a conventional cache with a capacity of 24 KB. Poorer 
performances are offered by the SS/NS; only just reaching the equivalent of a 19 KB 
conventional cache. We have omitted the Jacobi values to estimate the average be- 
cause it presents very large values showing an unusual behavior. 



Table 2. Tours in the 16 KB classic cache and reduction in tours (%) offered by the other 
schemes 





# Tours 


% Reduction in Tours 


Benchmark 


16 KB 


FILTER 


FILTER swap 


NTS 


SS/NS 


32 KB 


Barnes 


2015920 


31.93 


27.09 


23.01 


13.00 


59.46 


FMM 


887092 


20.46 


17.06 


10.17 


-49.49 


35.28 


LU 


2353847 


7.74 


4.45 


3.48 


0.23 


3.75 


FFT 


1433600 


2.13 


-2.94 


2.34 


0.13 


12.27 


RADIX 


516246 


0.22 


-0.07 


0.20 


0.30 


0.26 


MM 


17072741 


0.55 


0.57 


0.57 


0.55 


0.58 


Jacobi 


16967214 


29.15 


27.77 


76.54 


0.10 


0.15 


Compress 


270003 


6.61 


6.46 


3.46 


2.89 


30.55 


Average 


5189583 


12.35 


10.05 


14.97 


-4.04 


17.79 



Table 3. Theoretic Equivalents Conventional Caches Capacities 



Benchmark 


FILTER 


Filter swap 


NTS 


SS/NS 


Barnes 


25 


23 


22 


19 


FMM 


25 


24 


21 


-6 


LU 


49 


35 


31 


17 


FFT 


19 


16 


19 


16 


RADIX 


30 


16 


28 


35 


MM 


31 


32 


32 


31 


Jacobi 


3093 


2947 


8095 


27 


Compress 


19 


19 


18 


18 


Average 


28 


24 


24 


19 
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5 Hardware Cost 

To calculate the hardware cost in bits in the first level of the cache hierarchy, all or- 
ganizations are assumed to be two-way set associative with a line size of 32 bytes. 

The hardware cost of the filter cache schemes cache is approximately 14% greater 
than that incurred by the 16 Kbyte conventional caches. On the other hand, the cost 
incurred by the 32 Kbyte cache is about 74 % greater than the filter schemes. In sum- 
mary, the filter data caches with only 576 lines (512 plus 64) improve performances 
over organizations with the same capacity; and even sometimes surpass the perform- 
ances offered by organizations using 1024 lines. 



6 Conclusions 

In this paper two new improvements of the filter data cache scheme have been pre- 
sented. We have evaluated the performances of these schemes and compared them 
with two other schemes that split the cache according to the criterion of the data local- 
ity (the STS and SS/NS) and recently appeared in the literature. 

In this initial study, we chose hit ratio and tour management as performance in- 
dexes that are independent of the memory access time. The unidirectional datapath 
between the first-level caches is used to prolong the time that a heavily referenced line 
spends at that level. In this sense, the results show that the proposed schemes offer 
better hit ratio and tour management than those splitting the cache according to the 
criterion of the data localities. The filter scheme offers tour management that equals a 
theoretically equivalent conventional cache with a capacity of 28 KB. In some cases, a 
cache with just 576 blocks (18 KB) improves the tour management offered by the 
classic cache with 1024 blocks (32 KB). 

Data localities are continuously changing, so lines sometimes offer a bad tour per- 
sistency; and this reduces the efficacy of those schemes that split caches according to 
the criterion of data localities. Their performances consequently drop. This problem 
does not appear in filter schemes that place most referenced lines in a small cache. In 
these schemes, a bad persistency implies that lines exhibiting bad tour persistency will 
be quickly replaced from the filter cache. In addition, we shift the counters to ensure 
that lines which dynamically change their localities do not continue residing in the 
filter cache. 
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Abstract. We present numerical results of three-dimensional global 
magneto-hydrodynamic (MHD) simulations achieved on Astrophysical 
Rotating Plasma Simulator (ARPS) developed at Chiba University. 
We simulate the time evolution of differentially rotating disks by us- 
ing a parallelized three-dimensional MHD code. Typical number of 
grid points is (Nr,N^,Nz) = (200,64,240) in a cylindrical coordi- 
nate system. We found that when the initial magnetic field is toroidal 
and relatively strong, the system approaches a quasi-steady state with 
P = Pgas/Pmag ~ 5. When the disk is threaded by vertical magnetic 
fields, magnetically driven collimated jet emanates from the surface of 
the disk. Fully vector-parallelized global simulations with ARPS enable 
us to study non-local effects such as magnetic pinch, saturation of non- 
linear growth of instability, and deformation of the global structure. 



1 Introduction 

Numerical simulation is a fundamental tool to investigate active phenomena in 
astrophysical objects because they occur under extreme conditions which lab- 
oratory experiments can not mimic. Rotation and magnetic fields often play 
essential roles in such phenomena 

When matter with angular momentum infalls from the interstellar medium 
or from the companion star, it spirals in and forms a rotating disk around the 
gravitating object. The matter inside the disk gradually accretes by losing angu- 
lar momentum. Such a disk is called an accretion disk. In conventional theories 
of accretion disks (see, e.g., Shakura and Sunyaev 1973), the turbulent viscosity 
and the Maxwell stress exerted by turbulent magnetic fields is postulated to 
transport angular momentum outward and to drive accretion. 

The gravitational energy released through the accreting process is believed 
to be the origin of activities in X-ray binaries, dwarf novae, and active galactic 
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nuclei. When a rotating disk is threaded by large-scale poloidal magnetic fields, 
centrifugal force and magnetic pressure drive outflows (see e.g., Blandford and 
Payne 1982 ; Uchida and Shibata 1985). 

Since three dimensional effects are essential in dynamo process and turbu- 
lence in accretion disks, we need to carry out 3D simulations. 



User Interface 

Web-based Browser 



Parameter Setting 
Initial Model 
Boundary Condition 
w Add-on Physics 



' ^3D Graphics 
Image 



ARPS Platform 




Fig. 1. The conceptual design of astrophysical rotating plasma simulator 



For such needs of astrophysical community, we are developing an astrophys- 
ical rotating plasma simulator (ARPS) by which we can carry out global three- 
dimensional magneto-hydrodynamic (MHD) simulations of rotating plasma. 
Fig.l shows the concept of ARPS. It consists of modules of initial model set- 
up, mesh generator, time integrator by finite differencing (engine), add-on sub- 
modules which incorporate various physics (e.g., resistivity, radiative cooling, 
heat conduction, self-gravity), and visualizer. Web based user-interface (Fig. 2) 
enable users to set up initial model, and boundary conditions, and to moni- 
tor the progress of their simulation. The sub-modules which incorporate various 
physics share the same data structure and can be plugged into the platform of 
the simulator. Each module is parallelized by using the MPI library. 

2 Importance of Global Simulations for the Study of 
Accretion Disks 

Conventionally, the viscosity inside the accretion disk is assumed with the phe- 
nomenological parameter a, i.e., the off-diagonal component of the stress ten- 
sor is assumed to be proportional to the pressure (tnj, = —aP). By comparing 
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Fig. 2. The user interface based on web browser 



the theory and observation, a is estimated to be a ~ 0.02. Molecular viscosity 
cannot afford with such high value of a (Cannizzo et al. 1988). Balbus and Haw- 
ley (1991) pointed out the importance of local magneto-rotational (or Balbus 
& Hawley) instability in accretion disks which generates turbulence in accretion 
disks and enhances angular momentum transport rate. When a differentially 
rotating plasma is threaded by magnetic fields (Fig. 3), instability grows if the 
radial force created by transporting angular momentum outward is larger than 
the restoring magnetic tension. This magneto-rotational instability grows even 
when the magnetic field is very weak. The maximum growth rate is the order of 
the angular velocity of the disk. The azimuthal magnetic fields also subject to 
the non-axisymmetric magneto-rotational instability (Balbus and Hawley 1992). 

Three-dimensional local MHD simulations of the nonlinear growth of the 
magneto rotational instability have been carried out by several authors (Hawley 
et al. 1995, Matsumoto and Tajima 1995, and Brandenburg et al. 1995). It turned 
out, however, that the growth rate and the saturation level of the instability 
depend on the size of simulation box and global structure of the magnetic field. 
Therefore, global simulation is required in order to evaluate actual value of a. 
Moreover, global simulation is essential to simulate global phenomena such as 
jet formation and collimation. 
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Fig. 3. The mechanism of magneto-rotational instability 



3 Three-Dimensional MHD Simulations of Accretion 



As an initial model for the disk, we assume that a rotating polytropic (InP oc 
In p) torus with constant angular momentum distribution L = Lq is threaded by 
toroidal magnetic fields and assume that the torus is embedded in a spherical, 
non-rotating isothermal halo (see Fig. 4). In a cylindrical coordinate system 
(r, z), the dynamical equilibrium of the torus is obtained by assuming 



where R is the distance from the center, is the square of the sound speed, 7 is 
the specific heat ratio. We take the radius where rotation velocity is equal to the 
Keplerian rotation velocity { v = vko ) as the reference radius tq. The constant 
<Fo is given at the reference radius, i.e., 'Pq = <F(ro, 0) (Okada et al. 1989). We also 
normalize the density and other variables at this radius, i.e., tq = vko = Po = 1 
at r = rg. A model parameter of the torus is Eth = {vso/vkoY M where Vso is the 
sound speed at the reference radius. We take Eth = 0.05. The halo parameters 
are Eh = {vsh/vKoY /l and Ph! Po where Vsh and ph are the sound speed and 
the density in the halo at (r,z) = (0,r-o), respectively. We adopt Eh = 1.0 and 
Ph/po = 10“^. 

We solved the ideal MHD equations in a cylindrical coordinate by using a 
modified Lax-Wendrofl scheme (Rubin and Burstein 1967) with artificial viscos- 
ity (Richtmyer and Morton 1967). The ideal MHD equations incorporated in 
ARPS are as follows: 



Disks 





( 1 ) 



( 2 ) 



( 3 ) 




dt 



d 

dt 



P 

A 



( 4 ) 
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where ip is the gravitational potential, and p, P, v, B, and 7 are the density, 
pressure, velocity, magnetic fields and specific heat ratio, respectively. We neglect 
the molecular viscosity for simplicity. The gravitational field is assumed to be 
given by a central point mass M . 

Using these equations, equation of energy conservation follows 



5,12 P 

a<2'” + — 



7 

— ) + V • (p 1 ;— + -P V 

Stt 2 7 — 1 



E X B 

47T 



) = -P vijj, (5) 



where E = — v x B is the electric vector. The time evolution of the disk is 
computed by solving equations (1), (2), (3), and (5). 

The number of grid points used in the simulations is (W, N^) = 

(200,64,240). We simulated only the upper half space {z > 0) and assumed 
that for the equatorial plane p, Vr, vp, Br, Bp, and P are symmetric and Vz 
and Bz are antisymmetric. The outer boundaries at r = rmax and 2 : = Zmax 
are free boundaries where waves can transmit. In order to avoid the singular- 
ity at P = 0 , we softened gravitational potential near the gravitating center 
(P<0.2ro). 




Fig. 4. (left) Initial model of a torus threaded by toroidal magnetic fields 
{(3q = 1). The radius at which the gravitation force equal to the centrifugal force 
in the torus is r = rg. (right) Density distribution and magnetic structure at 
t = 6.2to- Solid curves show magnetic field lines at the equatorial plane and the 
gray scale shows the density distribution (box size is 9ro x 9ro x 9ro) 

The initial Lorentz force in the torus is assumed to be in equilibrium with the 
gravitational force, centrifugal force, and gas pressure gradient. The magnetic 
pressure is assumed to be equal to the gas pressure, /3q = Pgas/Pmag = 1 at 
r = rg. Figure 4 shows the initial model of this simulation. The solid curves 
show magnetic field lines and grey scale shows density distribution. 

We added 1 percent ( O.OIi;^ ) random perturbation on azimuthal velocity 
at t = 0 and followed the evolution (Machida et al. 2000). The revolution time 
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is measured by the rotation period at the reference radius ( to = 27rro/t'Ko )• 
When the magnetorotational instability grows after several revolutions, mag- 
netic turbulence developing in the disk tangles magnetic field lines (Fig. 5). As 
the angular momentum is efficiently transported outward, the torus becomes 
flattened. 




Fig. 5. Global Structure of the magnetic held (box size is 9ro x 9ro x 9ro). 
Isosurface of (3 is shown by grey scale. The dark grey region shows the isosurface 
of (3 = 0.1. It is easily recognized that magnetic loops are floated up from the 
disk 



Figure 5 shows the magnetic structure of the accretion disk after 6.2 revolu- 
tions. The dark grey surface region is where magnetic held is largely enhanced 
{(3 < 0.1). As like the Solar corona, magnetic loops buoyantly rise from the disk. 
We can observe magnetic loops elongated in the azimuthal direction. 

When an accretion disk is threaded by large scale open magnetic fields, mag- 
netically driven bipolar jet emanates from the disk (Uchida and Shibata 1985). 

Figures 6 shows a result of 3D MHD simulation of a torus initially threaded 
by global vertical magnetic fields. The model parameters are the same as those 
in the toroidal held model, though the initial plasma f3 at (r, z) = (0,ro) is 
(3=2. Small amplitude non-axisymmetric perturbations are imposed for az- 
imuthal velocity. Within one rotation period, magnetically driven jet emanates 
from the torus. Although helical structure appears in the jet owing to the growth 
of non-axisymmetric instabilities in the disk, jets are not disrupted by the non- 
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Fig. 6. (left) Numerical results of 3D global MHD simulations of jet formation 
from a torus threaded by vertical magnetic fields at the stage of half revolution. 
Grey scale shows the gradient of density distribution. Curves show magnetic field 
lines, (right) Density distribution and magnetic fields at the stage of 2 revolution 



axisymmetric instability. In order to demonstrate that the magnetic field is wig- 
gled on surface of sharp density gradient , the gradient of the density distribution 
in logarithmic scale is shown by grey scale in figures 6. After 2 revolutions, non- 
axisymmetric structure appears in the jet. 

The numerical results can be compared with the high resolution VLBI (Very 
Long Baseline Interferometry) observations of jets in active galactic nuclei. 

4 Summary 

Global three-dimensional simulations of accretion disks reproduced various phe- 
nomena in accretion disks such as efficient angular momentum transport and jet 
formation. When the torus is threaded by weak magnetic fields, magnetic fields 
are amplified by magneto-rotational instability. After the amplification of mag- 
netic energy saturates when /3 ~ 10, the system approaches a quasi-steady state. 
Numerically obtained value of the angular momentum transport parameter a 
(of ~ 0.01 — 0.1) is consistent with observations. Inside the disk, filamentally 
shaped, locally magnetic pressure dominated regions appear. Magnetic energy 
release in strongly magnetized regions can explain violent X-ray time variabilities 
characteristic of black hole candidates. 

The global simulation of the accretion disks needs 200,000 time steps for 13 
revolutions at r = r-Q and it takes about 24 hours on 15 GPUs of VPP300/16R at 
NAOJ. For visualization analyses of data, we store the numerical data at every 
2,000 time steps. The total data size is 10.8 Gbyte per one model simulation. 
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Three dimensional analyses are done with 3D graphic software AVS which is 
added on visual interface of ARPS. 

The efficiency of vectoralization of ARPS is already achieved 98% and the 
parallel performance increases almost linearly with the number of CPUs and 
attains 90% performance when 15CPUs of VPP300 at National Astronomical 
Observatory are used (13.4 times faster than ICPU). 

The ARPS successfully demonstrated its capability to investigate accretion 
disks by direct numerical simulations using user-friendly interface. With this 
simulator, one can simulate various astrophysical objects such as X-ray binaries, 
Quasars, Active Galactic Nuclei, etc. 

Works are in progress to install more powerful numerical engine and enable 
remote operation through network. 
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Abstract. Usually physical systems are characterized by different cou- 
pled parameters accounting for the interaction of their different compo- 
nents. The Cellular Automata Network (CAN) model [1] allows to repre- 
sent each component of a physical system in terms of Cellular Automata 
(CA) [9], and the interaction among these components in terms of CA 
networks. In this paper we report our experimentations in exploiting two 
different kinds of parallelism offered by the CAN model using policies for 
network restructuring and thread assignment. At this purpose we used a 
prototype graphic tool (CANviz) designed to let the user experimenting 
heuristics to efficiently exploit two-level parallelism in CAN applications. 



1 Introduction 

Complex systems that evolve according to local interactions of their constituent 
parts can be simulated through Cellular Automata (CA) programming [9]. Ap- 
plications of CA are very broad, ranging from the simulation of fluid dynamics, 
physical, chemical, and geological processes. 

In order to capture microscopic and macroscopic aspects involved in phys- 
ical phenomena simulation in a uniform representation, we used the Cellular 
Automata Network (CAN) model [1]. CAN model allows to represent each com- 
ponent of a physical system in terms of cellular automata, and the interactions 
among these components in terms of CA networks. CAN model offers potentially 
two different kinds of parallelism: data parallelism, coming from the local interac- 
tions of each cell composing the cellular automaton only with its neighborhood, 
and eontrol parallelism coming from the CA network execution model. 

In this paper we report our experience in exploiting both levels of parallelism 
and how developers can drive multi-level parallelism exploitation. In section 2 
the CAN model is discussed; section 3 describes the extensions of PECANS en- 
vironment [3] used to build applications in CAN Language (CANL) [2]; section 4 
reports our experimental results obtained exploiting parallelism on a target par- 
allel architecture. Finally some conclusions and future works are reported. 

* Partly supported by CNR Project: “Sviluppo di una Modellistica Sperimentale 
Spazio-Temporale di Processi Evolutivi dell’Ambiente per la Mitigazione dei Rischi”. 
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2 The Cellular Automata Network Model 

By their nature as systems with discrete space dimensions and discrete time 
evolution, CA are useful for simulating spatio-temporal phenomena providing a 
framework for a large class of discrete models with homogeneous interactions. 

A Cellular Automaton [9] consists of a regular discrete lattice of cells, with 
a discrete variable at each cell assuming a finite set of states; cells are updated 
synchronously in discrete time steps according to a local, identical interaction 
rule. According to this rule, referred to as the cellular automaton transition 
function, the state of each cell evolves in time and its evolution depends on the 
state, at the previous time step, both of the cell itself and of a finite number of 
neighbor cells. The neighborhood of a cell consists of the surrounding adjacent 
cells according to a specified topology. 

The CA model we used in our work, called the Cellular Automata Network 
{CAN) model [1], extends the model above mentioned with the network of cellu- 
lar automata abstraction. The CAN model can be applied when the construction 
of complex physical phenomenon models can be obtained by means of a reduc- 
tion process in which the main model components are identified through an 
abstraction mechanism and interactions among components can be identified. 
So the CAN model provides the possibility to simulate a two-level evolutionary 
process in which the local cellular interaction rules evolve together with cellular 
automata connections. In this way, global information-processing capabilities, 
that are not explicitly represented in the network elementary components or in 
their interconnections, can be obtained. 

In CAN model an automaton is denoted by a name, and its behavior is de- 
scribed by a set of properties, a transition function, and a neighborhood type. 
A property can correspond either to a physical property of the system to be 
simulated, such as temperature, volume and so on, or to some other feature of 
the system such as the probability of a particle to move and so on. In any case 
according to the standard CA model, each property corresponds to a computa- 
tional grid. In this schema a cell of an automaton is considered as a functional 
composition among the cells of the automaton properties. A necessary require- 
ment is that the cells of property grids must be in correspondence among them. 

According to CAN model, when the physical system components are partially 
coupled it is necessary to introduce a network of cellular automata, i.e. to define 
a set of automata specifying a dependence relation among them as follows: 

Definition 1. Let A and B be two automata, if one or more properties of the 
automata A are used inside the transition function of the automata B, then we 
say that B depends on A, i.e. ASB. 

The CA network dependence relations can be represented by a direct acyclic 
graph, called the CAN dependence graph, whose nodes represent the cellular 
automata of the network and arcs represent the dependence relation between 
two nodes. The CA network dependence relations impose “precedences” among 
the execution of the network automata, so a precedence relation graph can be 
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obtained from the dependence relation graph. In fact, let he A = {ai, . . . ,a„} 
the set of cellular automata composing a network, we define on this set the 
“precedence” relation that is a, -< aj if and only if a, must be completed 
before aj can start execution; then, if a, 7 ^ aj and aj 7 ^ a*, a* and Uj can be 
concurrently executed. The couple {A, -<) is a partially ordered set. 

In our computational model, we have that a, has to be executed before aj, 
i.e. Ui -< aj, if and only if there is a set of automata {ai,...,a„} such that 
a\ — a^, a^i — a j , and V/ G ; ai8ai-\. \ . 

3 The Extended PECANS Environment 

Applications designed according to the CAN model are written using the Cellular 
Automata Network Language (CANL) [2], specifically designed to express the 
CAN model components. The language provides a set of primitives to define 
both a cellular automaton, together with all its features, and a network of cellular 
automata explicitly declaring the dependence relations that occur in the network. 

We used the PECANS environment [3] to write CANL applications and exe- 
cute them on specific architectures that can be sequential or parallel. In PECANS 
a CANL program is written, it is cross-compiled in the C language and then 
linked to the run-time environment of the architecture where the CANL appli- 
cation must be executed. The comparison of PECANS with other programming 
environments is reported in [ 11 ]. 

In the activity of computer simulation a very crucial requirement regards per- 
formances. CA approach proved to be a good candidate to meet this requirement 
since it allows to exploit the data parallelism intrinsic to the CA programming 
model coming from the local interactions of each cell composing the cellular au- 
tomaton only with its neighborhood. So, the standard CA programming model 
maps quite naturally a SIMD execution model. Another source of parallelism in 
CAN applications is the eontrol parallelism deriving from concurrently executing 
automata among which precedence relations either do not occur. 

We extended PECANS with a graphic tool (CANviz) for the visualization of 
the CAN program structure and the parallel code generation control . CANviz 
is implemented in Tcl/Tk [10, 7] using a library, named TclDot [5], that adds 
graph manipulation facility to Tcl/Tk. 

CANviz visualizes the CANL program code and the dependence graph repre- 
sentation of the CA network, showing the correspondence between CANL state- 
ments and CA network nodes and arcs. The main purpose of CANviz is to let the 
user drive the amount of parallelism to be exploited in CANL programs, when 
multi-level of parallelism is available. In fact the tool allows to graphically per- 
form a network restrueturing poliey guaranteeing that the obtained precedence 
relation graph preserve the program correctness. 

CANviz highlights nodes of the CA network which are candidate for data 
parallelism exploitation allowing the user to enable or disable it. Control par- 
allelism is visible to the user in the precedence graph structure since all the 
branches in the graph represent tasks that can be concurrently executed, while 
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Fig. 1. The CAN Visualization Tool 



branches reaching a node represent a control synchronization point. Finally, an 
additional feature of CANviz allows to associate weights to the cellular au- 
tomata, to be taken into account when deciding the amount of parallelism to be 
exploited in case of multi-level parallelism exploitation. 

4 Experimental Results for Multi-Level Parallelism 

CA networks can be executed on a parallel computer exploiting either the data 
parallelism or both control and data parallelism resulting in a multi-level par- 
allelism application. It is clear that policies are necessary to decide the amount 
of parallelism to be spawned for each level in order to have a better exploitation 
of the available parallel computer resources [8] . 

In our implementation only the outer level of control parallelism can be 
exploited: branches belonging to paths of the network deriving from previous 
branches are serialized in the execution. Therefore, once network execution forks 
on a branch, only data parallelism can be exploited. 

In order to gain more information on the real advantages of using the multi- 
level parallelism potentially offered by the CAN model, we made experiments 
using the SGI 0rigin2000 multiprocessor computer [6]. It is a cache-coherent 
NUMA multiprocessor with 6 dual-processor basic node board. Each node is 
equipped with two 195 MHZ MIPS RIOOOO processors and it accommodates a 64 
KBytes primary cache and a 4 MBytes secondary cache per processor. Each node 
has 4 GBytes of DRAM memory. The SGI 0rigin2000 uses the IRIX operating 
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Fig. 2. The 4-branch test program 



system version 6.5.4. We chose to use the IEEE POSIX threads package [4] to 
implement the two-level parallelism in a nested way. 

In order to be able to compare performances, in terms of execution times, ob- 
tained exploiting the one-level and the two-level parallelism approaches, varying 
both the problem size and the number of the cpus used, we designed a synthetic 
CAN application built in such a way that automata of the network represent 
computational tasks with equal workloads. 

In our experiments we deal with both balanced and unbalanced CA networks, 
where unbalancing occurs when the branches of the network have different costs 
in terms of execution time. So unbalanced networks can result when branches 
contain a different number of automata, and/or automata with different work- 
loads, and/or serialized automata (i.e. automata for which no data parallelism is 
possible due to the use of global variables). In our test application, unbalancing 
derives only from the different number of automata on network branches. 

The objective of our experimentations was to find out some preliminary re- 
sults to individuate policies to decide which parallel execution to adopt according 
to the network configuration and the parallel machine resources. 

The experiments were organized in two sets. For the first set, we compared 
speedups obtained exploiting multi-level parallelism versus one-level data par- 
allelism approach. Therefore, we chose an application consisting of a balanced 
network of 6 automata whose precedence relations are shown in figure 2. 

The second set of experiments was conceived to study the effects on perfor- 
mances of multi-level parallelization of unbalanced CA networks. In this case 
the application consists of the same set of automata as the previous ones, but 
with different dependence relations that generate unbalancing as shown in fig- 
ure 3(a). Note that, from an execution time point of view, the sequential and the 
data parallel versions of the latter network are the same as the previous case. 

In the precedence relation graph of this network, it is possible to add some 
precedence relations (network restructuring policy) without affecting the pro- 
gram correctness due to the precedence relation transitivity. In fact, introducing 
unnecessary precedence relations in our example, as shown in figure 3(c), a dif- 
ferent synchronization point in the network parallel execution results. 

In all experiments we measured application speedups obtained by the ratio 
of the sequential execution time with the CAN multithreaded execution time. 
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Fig. 3. (a) The 3-branch test program with (b) threads remapping and (c) restructured 
network 



In each experiment the number of threads allocated for parallelism (for both 
levels) is fixed and threads are bound to processors so the number of generated 
threads represents the actual number of cpus allocated for the application. We 
chose different resource assignment policies (thread assignment poKey), and for 
each choice the problem size was varied by squaring the automata grid starting 
from 64x64 up to 2048x2048 elements for each property. 

In both set of experiments the data parallel version is obtained executing 
all network automata in the sequential order specified by the user, and the 
available threads execute the transition function on portions of the automaton 
grid obtained dividing each property along rows into equal-sized chunks. 

For the first set of experiments the results are reported in figure 4(a). When 
exploiting the two-levels of parallelism, the thread allocation policy is to assign 
one thread to each branch of the C A network and, within each branch, to allocate 
and uniformly distribute the remaining threads for data parallel processing. 

First we notice that, except for small problem sizes (up to 128x128), the 
two-level and the one-level parallelism approaches show speedups that increase 
with the number of threads once fixed the problem dimension, and the two-level 
approach performs better than the one-level approach. In fact, in the one-level 
approach all threads are assigned to each automaton for exploiting data paral- 
lelism, so for small problem size this results in poor workloads for concurrent 
threads; therefore thread creation and management overheads have significant 
costs. Exploiting also the control parallelism level means to assign the computa- 
tions of different automata to different threads. When the number of available 
threads is fixed, thread data chunks result larger than the ones used only for the 
data parallelism. Once fixed the available resources, the two approaches show 
similar performances when the problem size increases because the thread man- 
agement and synchronization overheads are less influent. 

For the second set of experiments, the results are reported in figure 4(b), 
where data parallelism speedups are compared to those obtained exploiting two- 
levels of parallelism with three different parallel executions. In the first case 
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Fig. 4. (a) 1-level versus 2-level parallelism in the 4-branch test program: (b) 2-level 
parallelism solutions in the 3-branch test program 



{unbalanced network) each branch spawns, in turn, the same number of threads 
for data parallelism, although the branch in the middle has twice as much work 
to do (see figure 3(a)). In the second case {restructured network) automaton 
4 has been shifted out from the middle branch into the main network branch 
(see figure 3(b)) obtaining a balanced network where threads can be equally 
distributed among the network branches. In the third case {threads remapping) 
a different number of threads is assigned to each network branch according to 
its cost (see figure 3(c)). 

First consider the speedups of the three approaches for small problem sizes 
(e.g. 128x128). As we notice only with six threads the unbalanced network per- 
forms worse than the other cases. As the number of threads increases the three 
approaches give similar speedups that are low due to the grain size of computa- 
tions carried out by each thread. 

Moreover, for this problem size, the restructuring and the thread remapping 
policies do not win for a greater number of threads due to the way loop iterations 
are distributed over the threads: the property row size is divided by the number 
of used threads but the rest of the division is assigned to the last thread in the 
pool. This extra computation introduces unbalancing which is more costly when 
small properties are mapped to an increasing number of threads. 

For mean and big problem sizes (e.g. 512x512 and 2048x2048) speedups in- 
crease with the number of threads because grain size is larger and parallelism 
gains overcome thread management overheads. Moreover, the unbalancing due 
to the rest of iterations coming from loop chunk definition has a relevant effect 
only when using 6 threads, since the rest is greater than in the other cases. 
This effect causes better performances of the restructuring policy towards the 
thread remapping one, while for a greater number of threads the two policies 
are equivalent. 
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5 Conclusions and Future Works 

In this paper we presented our preliminary results obtained exploiting the two- 
level parallelism potentially offered by the CAN model. As we showed in the 
experimental results, when exploiting multi-level parallelism, CA network re- 
structuring and thread assignment policies have to be adopted to better use the 
available parallel computational resources of the target machine. These policies 
can be driven by the user through the CANviz tool. 

We plan to carry out more experiments on real CAN applications to study 
the feasibility to extract heuristics that could be adopted in choosing a network 
configuration and a thread scheduling policy for better parallelism exploitation. 
In this way it will be possible to reduce user interaction by making it automatic 
a possible restructuring of the CA network and the selection of the one that 
better matches the available parallel resources. At this purpose we plan to build 
a new module (based on CANviz tool) in the PECANS environment to provide 
a CA parallel programming environment that, with a limited user interaction, 
produces parallel code so that no particular skills in parallel computing area are 
required to users of the environment. 
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Abstract. Because of the irregular and dynamic data structures, paral- 
lel programming in non-numerical field often requires asynchronous and 
unspecific number of messages. Such programs are hard to write using 
MPI/Pthreads, and many new parallel languages, designed to hide mes- 
sages under the runtime system, suffer from the execution overhead. 
Thus, we propose a parallel programming language Orgel that enables 
brief and efficient programming. An Orgel program is a set of agents 
connected with abstract channels called streams. The stream connections 
and messages are declaratively specified, which prevents bugs due to the 
parallelization, and also enables effective optimization. The computation 
in each agent is described in usual sequential language, thus efficient 
execution is possible. 

The result of evaluation shows the overhead of concurrent switching 
and communication in Orgel is only 1.2 and 4.3 times larger than that 
of Pthreads, respectively. In the parallel execution, we obtained 6.5-10 
times speedup with 11-13 processors. 



1 Introduction 

To obtain high performance using parallel machines, the means for brief and 
efficient programming is necessary. Especially in the non-numerical processing, 
existing programming methods require low-level specifications or suffer from 
large runtime overhead. 

So, we propose a new parallel programming language called Orgel, which has 
both abstract description of parallelism and runtime efficiency. The program- 
ming paradigm of Orgel is multi-agents connected with abstract communication 
channels called streams. The agents run in parallel, passing messages via streams. 

The computation of each agent is described with a usual sequential language. 
Thus the execution of each agent is efficient. The connection among agents and 
streams are declaratively described. This feature statically determines the paral- 
lel model of the program, thus prevents bugs in the communications and enables 
strong optimization using static analysis. 
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This paper is organized as follows: Section 2 describes the background. Sec- 
tion 3, 4 presents the language design of Orgel and the current implementation. 
Section 5 shows the result of evaluation, and in Section 6 we give the conclusion. 

2 Background 

For the non-numerical processing, automatic parallelization is extremely difficult. 
The dynamic and irregular structures like lists and trees cannot be statically 
divided, and the program structure is also irregular because of recursive calls. 
Therefore many parallel programming method have been proposed for this field. 

One of the major method is to use message passing libraries (PVM [1], 
MPI [2]) or thread libraries (Pthreads [3]) on a sequential language like C, For- 
tran, etc. This way is easy to learn for the users accustomed to the sequential 
programming language, and efficient programming is possible by the low-level 
tuning. However, the order and number of communications are nondeterministic 
in many non-numerical programs. In such cases, the mismatch of corresponding 
sends/receives easily occurs if low-level communications are explicitly specified. 

Another method is to design a new programming languages with parallel 
execution semantics (KLl [4]). Parallelism can be naturally described in such 
languages. And owing to the abstraction of communications and synchroniza- 
tions, the user can avoid timing bugs. However, such abstraction causes large 
overhead at runtime, and leads to inefficiency. 

To reduce such overhead, the optimization schemes using static analysis have 
been proposed. For example, our optimization scheme for KLl achieved remark- 
able speedup for typical cases [5, 6]. However, precise static analysis of dynamic 
behavior is difficult, thus the optimization is ineffectual in some cases. 

So, the desirable parallel programming language should have the following 
features: 1) efficient and similar style with the usual sequential programming, 
and 2) abstract specification of parallelism and communications to reduce the 
burden of users. The specification also should help static optimization. 

3 Language Design 

3.1 Language Overview 

Orgel is designed for non-numerical programming. Because automatic paral- 
lelization is difficult in this field, Orgel leaves the specification of parallelism and 
communications to the user, and supplies frameworks for such specification. 

The execution unit of Orgel is called an agent. We also introduce an abstract 
message channel called a stream, for the inter-agent communication. Thus, the 
Orgel program is represented as a set of agents connected by streams. The agents 
run in parallel, passing messages via streams. 

The syntax of Orgel is based on C. We added l)declarations of stream/agent/ 
message/network connection; 2) statements for message creation/transmission/ 
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stream StreamType [inherits StreamTypel [, ...]] { 
MessageType [(.Type Arg -.Mode [, ...])]; 

}; 

Fig. 1. Stream type declaration 



dereference and agent termination; and 3) agent member functions. We also elim- 
inated global variables and added agent member variables. Thus each function 
can be coded efficiently in usual sequential programming. 

As we describe in Section 3.2, 3.3, the structure and behavior of streams 
and agents are defined as stream types and agent types. The instances of agents 
and streams are automatically created at runtime, according to the variables 
definition of stream types or agent types. They are automatically connected 
according to the connection declaration (see Section 3.4). 

When an Orgel program is executed, a main agent is created and starts its ex- 
ecution. If the main agent type contains some variable definitions of agent/stream 
types, their instances are also automatically created streams are connected to 
the agents, and the agents start execution in parallel. Thus, the network of agents 
and streams are built without operational creation nor connection. 

Compared with other many multi-agent/object-oriented languages [7, 8], this 
declarative specification of network clears the parallel execution model of the 
program. It prevents the creation of unexpected network structure, and also 
enables precise static analysis which leads to effective optimization. 

3.2 Stream 

A stream is an abstract message channel, based on KLl’s stream communication 
model [9]. Our stream has direction of message flow, and one or more agents can 
be connected to each end. 

A stream type is declared in the form of Fig. 1. This declaration enumerates 
message types that a stream type StreamType accepts. An Orgel message type 
takes the form of a function: MessageType which is used as the message identifier, 
and a list of arguments with types and input/output mode (in/out). 

3.3 Agent 

An agent is an active execution unit, which sends messages each other while 
performing its computation. 

To create an agent, an agent type declaration in the form of Fig. 2 is needed. 
The form of agent type declaration is similar to C function declaration. The 
arguments of an agent type are its input/output streams with types and modes. 

Member functions are defined in the same way as usual C functions, except 
the function name is specified in the form: AgentType: : FunctionName. 

^ For the efficient execution, the creation is delayed until a message is sent to them. 
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agent AgentType(.L StreamType StreamName: Mode [,...] ] ){ 
member function prototype declarations 
member variable declarations 
connection declarations 
initial Initializer-, 
final Finalizer; 
task TaskHandler-, 
dispatch (.StreamName) { 

MessageType: MessageHandler, 

}; 

}; 

Fig. 2. Agent type declaration 

For member variables, independent memory areas are allocated to each agent 
instance. The scope of member variables is within the agent type declaration and 
all member functions of the agent type. 

Agents and streams are logically created by the definition of member variables 
of the declared types. In this paper, we call these variables agent variables and 
stream variables. As explained in Section 3.1, physical creation is automatic 
without operational creation. 

Connection declarations specify how to connect agents and streams defined 
as member variables. We will show details of this declaration in Section 3.4. 

The last four elements: initial, final, task, dispatch; defines event han- 
dlers. The handlers are sequential C code, with extensions for message handling. 
Initializer and Finalizer are executed on the creation and destruction of the 
agent, respectively. TaskHandler defines the agent’s own computation, and is 
executed when the agent is not handling messages. And dispatch declaration 
specifies the message handler to each message type of input stream StreamName, 
by enumerating acceptable message types MessageType and handler code Mes- 
sageHandler. This declaration works as an framework for asynchronous message 
receiving. 

3.4 Connection Declaration 

The connection among agents and streams can be specified by a connection 
declaration in the following form: 

connect [AgentO.SO DirO ] Stream Dirl Agentl.Sl ; 

The declaration takes a stream variable Stream and input/output streams SO, 
SI of agent variables AgentO, Agentl. A specifier self can be used in place of an 
agent variable, to connect the agent that contains the connection declaration. 
DirO, Dirl are direction specifiers (==> or <==) that indicates the message flow. 

If the agent/stream variable is an array, an array specifier in form of \_Sub- 
scriptExpressionl is needed. If the subscript expression is a constant expression. 
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(b) network of agents 



Fig. 3. Example of connection declarations 



the declaration argument means an element of the array. If the expression is 
omitted, the argument means all elements of the array. 

The subscript expression may contain one identifier called pseudo-variable. 
It works as an variable whose scope is within the connection declaration, and 
represents every integer values with the restriction that each subscript expression 
does not exceed the array size. 

By using array specifier, a set of one-to-one connections, or one-to- 
many/many-to-one connection, can be declared. A message to a stream with 
multiple receivers are multicasted to them, and messages to a stream from mul- 
tiple senders are nondeterministically sequentialized. 

An example is shown in Fig. 3(a). Here we regard that worker is an agent 
type, and comm, broadcast are stream types. In the first connection declaration, 
each subscript expression is restricted to the size of array w and right. Thus 
the value of pseudo-variable i is 0 . . . 14, and as shown in Fig. 3(b), the output 
stream ro of each agent is connected to the input stream li of the right neighbor 
agent, via each right stream. By the third declaration, the stream b’s sender 
side is connected to the agent main, and the receiver side is connected to every 
worker type agent. Thus this stream works as a broadcast network from main 
to all worker agents. 

Because the connect declaration statically defines network model, the com- 
piler can make static analysis precisely and can optimize scheduling and com- 
munications. Such optimization is much difficult with the operational stream 
connection in A’UM [10] or AYA [11]. Candidate Type Architecture [12] also 
offers declarative network configuration, but our stream model is more flexible. 

3.5 Message 

Message Variables Using message types declared in the stream type declara- 
tion, variables for messages can be defined. We call them message variables. 

A message variable acts as a logical variable in logic programming languages. 
Its initial state is unbound, and changes to bound when assigned with a message 
object or other message variables. The bound state has two cases: if the variable 
is assigned to a message object, the state is instantiated] and if the variable is 
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stream command{ 

processdnt n:in, map m:in, result r:out); 
map (char dat[32]: in); 
result(int answer: in); 

} 

process p; 
result r; 
int i ; 
char d [32] ; 

p = process(i, map(d), r) ; 
s <== p; 



(l)create message objects 




char d[32] ; 
dispatch ( s) { 

process(i, m, r):{ 
m ? = = map ( d ) ; 



r = result ( j ) ; 



(2) copy data 
i on dispath 

i 

P 



d[32] 



ID (5) send back 
^ an out-moded 

argument 

^ (1) receive 

a message 



ID ; 
dat[32] 

""••I, 

; (3) copy data on dereference 



(4 ) instantiate an out-moded argument 



Fig. 4. Sending Messages 



Fig. 5. Receiving Messages 



assigned to other variable that is not instantiated, the state is uninstantiated. In 
the latter case, later assignment with a message object can change the state of 
every related variables to instantiated. 



Creating and Sending Messages To create a message object, the message 
type and actual arguments is described in a functional form. For example, if a 
stream type is declared as shown in Fig. 4, a message variable can be defined 
and assigned to a message object as shown in Fig. 4(1). 

The type of a message argument must be any C data type except pointer 
types, or any message type. 

In the former case, the actual argument value is stored in the message ob- 
ject. If the argument type is an array, the actual argument should be a pointer 
indicating the head of array data of declared size. In the example of Fig. 4(2), 
arguments of int and an array of char are stored in the message. 

In the latter case, the mode can be either in or out. If the mode is in, 
this argument is instantiated by the message sender. The actual argument must 
be a message object of the declared argument type. It can be an uninstantiated 
variable on the message creation, but must be instantiated by the message sender 
in time. If the mode is out, this argument is instantiated by the message receiver. 
The actual argument must be an uninstantiated message variable. 

The created message object is sent to a stream by a send statement of the 
following form: 

Stream <== Message ; 



Receiving Messages An agent, connected to a stream’s receiver side, receives 
and handles messages according to the dispatch declaration of the agent type. 
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dispatch declaration specifies one of the agent’s input streams, and enu- 
merates acceptable message types. When a message of the type is received, the 
argument value of the message object is stored in the corresponding variables 
specified in dispatch. Similar to the message creation, the variable correspond- 
ing to the array argument is regarded as a pointer, and the area of array size is 
copied. 

If the message has out moded arguments, the corresponding variables will be 
uninstantiated variables. These variables are bound to the sender’s corresponding 
variables, and by instantiating receiver’s variables with message objects, the 
objects are sent back to the sender. 

The messages that appears as other message’s argument can be obtained by 
a dereference expression in the following form: 

Variable ?== MessageType[(Argl[, ...])]; 

If a message variable Variable is uninstantiated, the agent executing this ex- 
pression is suspended, and resumed when other agent instantiates the variable. If 
the message’s type is MessageType, The expression returns non-zero and assigns 
Argl, ... to the corresponding arguments. If the type differs, it returns zero. 

Fig. 5 shows an example of receiving messages, sent in the example of Fig. 4. 
When a process message arrives (1), the receiver agent assigns the variable i to 
the value of argument n (2). Next, by a dereference expression, the message-type 
argument m is obtained and the value of argument dat is copied to d (3). And 
finally, by assigning a message object to the out moded argument r (4), the 
object is sent back to the sender of process message (5). 

4 Implementation 

Using Pthreads, the current implementation supports concurrent execution on 
a single-processor or parallel execution on shared-memory multi-processors. 

The implementation consists of a Orgel compiler called Ore and Orgel run- 
time libraries which support agent management and stream communication. Ore 
is implemented as an Orgel-to-C translator. The automatically generated C pro- 
gram is compiled by a C compiler, linked with Orgel runtime libraries, and the 
executable file is generated. 

4.1 Implementation of Agents 

For each agent type declaration shown in Fig. 2, Ore generates agent main 
funetion. On the creation of an agent, a thread is created and starts execution 
of the corresponding agent main function. 

In an agent main function, initial handler is called first. Then in the main 
loop, each message handlers are called according to the received message types, 
and the task handler is called if no message is received. A terminate statement 
is compiled into the code that breaks the main loop. And when the main loop 
ends, final hander is called before the thread terminates. 
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The agent member variables are translated into a C struct type that has each 
variables as members. The agent main function defines a variable of this struct 
type, and its pointer is added to the arguments of member functions. Each access 
to member variables are replaced with the access to the corresponding member 
of the struct. Thus, the instance of member variables are allocated for each agent 
instance, and can be accessed in any member functions. 

To suppress the number of threads for efficiency, and to enable lazy creation 
of agents, an agent instance is represented by an agent record. Corresponding to 
the logical creation of agents, agent records are first created. And Orgel runtime 
schedules agents using these records, creating threads in case of need. 

4.2 Implementation of Streams and Messages 

A stream instance is represented as a stream record, which keeps connection 
information and a message queue. 

Because the structure of an Orgel message is statically declared, it can be 
compiled into a C struct type. Every message struct type has a ID field to distin- 
guish message types and a logical pointer of a message struct to form a message 
queue for streams. The C-data-type arguments of the message are compiled as 
members of the struct, and the message- type arguments are represented as a 
logical pointer to the corresponding struct type. 

The messages are not always freed in the creation order, because of the 
message-type arguments. So they are allocated on a global heap and managed 
using garbage collection (GC). This heap has a 2-level structure; A message 
variable contains the index for heap entry table, and each entry has the address 
of corresponding message in the heap. Thus on GC, the runtime system can pack 
messages without changing the value of message variables. 

4.3 Implementation of Sending/Receiving Messages 

Message operations in agent type declarations and member functions are re- 
placed by the calls of corresponding functions in the runtime library. The once- 
assignment rule for the message variables is assured by compile-time and runtime 
check. For the latter, Ore inserts some inline code to check the restriction. 

The suspension/resumption of agents are implemented as follows: The deref- 
erence function checks if the message variable is instantiated. If it is uninstan- 
tiated, the function creates a hook record with a condition variable [3] and 
suspends the thread. When a message object is assigned to the variable, the 
inserted code finds the hook and sends a signal to resume the suspended thread. 

5 Evaluation 

We evaluated the prototype implementation using 2 programs: nqueen and pia. 
The latter is a multiple protein sequence alignment program by parallel iterative 
improvement method. 
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Table 1. Sequential Performance (On SSlO+Solaris 2.5) 





C-pPthreads Orgel 


ratio 


switching 

communication 


18.50/is 22.58/rs 
2.96ps 12.76/rs 


1.22 

4.31 





C Orgel 


ratio 


nqueen 

pia 


66.54s 73.88s 
12.18s 12.78s 


1.11 

1.05 



(a) (b) 



Table 2. Parallel Performance (On SPARCcenter + Solaris 2.6) 





sequential parallel 


speedup 


nqueen 

pia 


213.58s 21.24s 
50.24s 7.72s 


10.06 

6.50 



5.1 Sequential Performance 

We evaluated the efficiency on single processor, by the comparison with sequen- 
tial C programs and concurrent programs using Pthreads library. 

Table 1(a) shows the execution time of nqueen and pia. C version makes the 
computation, equivalent to Orgel agents, sequentially in loops. The result shows 
that the overhead using Orgel is only 5-11% compared to C. 

Table 1(b) shows overhead of Orgel runtime, compared with directly using 
Pthreads in C. We used a benchmark program that repeats transferring an in- 
teger value between 2 threads. The C-|-Pthreads version uses shared variable for 
integer transfer, and uses semaphore for synchronization. 

The thread switching overhead of Orgel is only 22% larger than C-|-Pthreads. 
To deal with many-to-many dependencies among threads, Orgel uses condition 
variables for synchronization. But its overhead is small enough. 

Even for transferring just an integer, a message must be created and sent 
via stream in Orgel. But the overhead is only 4 times larger than using a shared 
variable. We regard it is small enough because the ratio of transmission overhead 
is smaller in practical programs, and the overhead using Pthreads grows larger 
when buffering data or transferring dynamic data structures. Still more, the 
current overhead of Orgel communication includes that of locking/unlocking 
stream records and the heap, which can be reduced using static analysis. 

5.2 Parallel Performance 

We executed the Orgel version of nqueen and pia on a shared-memory multi- 
processor machine SPARCcenter (Solaris 2.6). The result is shown in Table 2. 

Nqueen obtained 10.06 speedup using 11 threads, which is almost linear 
speedup. Pia obtained 6.50 speedup using 13 threads. The communications in 
pia are 1-to-many or many-to-1, and by optimizing the mutual exclusion on 
message transmission, the performance will be improved. 
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6 Conclusion 

In this paper we proposed a new parallel programming language Orgel, and 
presented its design, implementation and evaluation. 

Orgel is based on an execution model that multi-agents, connected with ab- 
stract message channels called streams, run in parallel. The distinctive feature 
of Orgel is declarative description of agent networks, which prevents communi- 
cation/ synchronization bugs and also enables precise static analysis for effective 
optimization. On the other hand, the computation of each agent is described 
sequentially, which enables to write efficient programs. 

The evaluation on prototype implementation shows the overhead on single 
processor is small enough compared to C or Pthreads library, and promising 
speedup is obtained on a multi-processor machine. 

We are currently working on the optimizer using static analysis. Supporting 
distributed- memory multi-processors is also our future work. 
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Abstract. The BSAp calculus is a calculus of functional BSP programs 
on enumerated parallel vectors. This confluent calculus is defined and 
a parallel cost model is associated with a weak call-by-value strategy. 

These results constitute the core of a formal design for a BSP dialect 
of ML 

1 Introduction 

Some problems require performance that only massively parallel computers offer 
whose programming is still difficult. Works on functional programming and par- 
allelism can be divided in two categories: explicit parallel extensions of functional 
languages - where languages are either non-deterministic or non functional ~ and 
parallel implementations with functional semantics - where resulting languages 
don’t express parallel algorithms directly and don’t allow the prediction of ex- 
ecution times. Algorithmic skeletons languages [5,9], in which only a finite set 
of operations (the skeletons) are parallel, constitute an intermediate approach. 
Their functional semantics is explicit but their parallel operational semantics is 
implicit. The set of algorithmic skeletons has to be as complete as possible but 
it is often dependent on the domain of application. 

We explore this intermediate position thoroughly in order to obtain univer- 
sal parallel languages where source code determines execution cost. This last 
requirement forces the use of explicit processes corresponding to the parallel 
machine’s processors. A denotational approach led us to study the expressive- 
ness of functional parallel languages with explicit processes [6] but is not easily 
applicable to BSP [12] algorithms. An operational approach has led to a BSP 
A-calculus that is confluent and universal for BSP algorithms [8], and a library 
of Bulk Synchronous primitives for the Objective Caml language which is suffi- 
ciently expressive and allows the prediction of execution times [1]. 

Our goal is to provide a framework where programs can be proved correct, 
can be given an a priori execution cost and can be of course implemented. It is 
to notice that we want a model simpler enough to allow a programmer to predict 
the performance of a program from its source code and to know where to modify 
the source code to improve the execution time of the program (if possible) 
However if the BSA calculus and the BSMLlib library share the same BSP 
primitives, there doesn’t exist any formal connection between them. So it is 
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possible to write a BSA program and prove its correctness and then write “the 
same” program with the BSMLlib library and use the BSP model to predict its 
performance. But as there is no formal connection between them we are not 
guaranteed that the BSA term and the BSMLlib program will have the same 
behaviour as well as for its correctness as for its performance. To attain our goal 
we need a tower of formalisms where each level is correct with respect to the 
next one. The ground floor will be an abstract machine (formal description of 
the implementation) and the last floor will be the BSA calculus. Chapter 5 of [7] 
presents two intermediate floors: the BSAp-calculus and a distributed semantics 
(called the distributed evaluation). Both are formalised as higher order rewrite 
systems. In this article we will present the BSAp-calculus in a more classical (and 
readable) manner and will provide an associated cost model. 

The main gap between the BSA-calculus and the BSMLlib library is that 
in the calculus, base parallel objects are expressed intensionally by a function / 
from processor names to values (the term tt/ represents the parallel vector where 
processor i holds the value fi), the network being potentially infinite, whereas 
the BSMLlib library runs on a network with p processors. The BSAp-calculus 
replaces the intensional vectors tt/ by enumerated ones (( eg , . . . , ep_i )), the 
enumeration being finite and with fixed width. As a result, the parallel inter- 
pretation of reduction as well as the cost of such a reduction is more naturally 
expressed than for the BSA-calculus. Moreover, it is still possible to express 
vectors in an intensional way, which is more convenient to write programs. 

Section 2 presents the syntax and rules of the BSAp calculus. Section 3 is 
devoted to a weak call-by- value strategy. The proof of confluence of this strategy 
highlights the parallel interpretation of reduction which is given a parallel cost 
in section 4. 



2 The BSAp-Calculus 

In this section we introduce an extension of the A-calculus called the BSAp- 
calculus. Its parallel data structures are flat and map directly to physical pro- 
cessors. This difference with certain languages, although apparently minor is 
crucial: BSAp programs require no flattening [3, ch. 10] and have thus complete 
control of the computation / communication ratio. 

The calculus introduces operations for data-parallel programming but with 
explicit processes in the spirit of BSP. We now describe the BSAp syntax and 
its reduction with operational motivations. The reader is assumed to be familiar 
with the elements of BSP [11]. 

Syntax We consider a set V of local variables and a set V of global variables. 
Let i, y, . . . denote local variables and x,y, . . . denote global variables from now 
on. X will denote a variable which can be either local or global. 

The syntax of BSAp begins with local terms t: A-terms representing programs 
or values stored in a processor’s local memory. The set T of local terms is given 
by the following grammar: 

t ::= X \ tt \ Xx. t \ c 
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where x denotes an arbitrary local variable. We will abbreviate to {ti — > ^2,^3) 
the conditional term ti t2 ta- We assume for the sake of simplicity^ a finite set 
Af={ 0 ,... ,p — 1 } which represent the set of processors names. 

The principal BSAp terms E are called global and represent parallel vectors 
i.e. tuples of p local values where the value is located at the processor with 
rank i. The notation is: ( to ; • ■ • j ^p-i ) where to, ■ ■ ■ , tp-i are local terms. 

The set T of global terms is given by the following grammar^: 

T::=x I TT \ Tt \ Xx.T \ Xx.T 

I (t ,... , t,..., t) I T#T I TIT I (t 4 t,T) 

and has the following denotational meaning. 

Global terms denote parallel vectors (finite maps from JV to local values) 
functions between them (Ax. T) or functions from local values to such vectors 
(Ax. T). 

The forms Ti ^ T2 and Ti ? T2 are called parallel application (apply-par) and 
get respectively. Apply-par represents point-wise application of a vector of func- 
tions to a vector of values, i.e. the pure computation phase of a BSP superstep. 
Get represents the communication phase of a BSP superstep: a collective data 
exchange with a barrier synchronization. In T1IT2, the resulting vector field 
contains values from T\ taken at processor names defined in T2 (Fig- 1, left). 




Fig. 1. Get and global conditional 

The exact meanings of apply-par and get are defined by the BSAp rules 
(Fig. 2). The last form of global terms define synchronous conditional expres- 
sions. The meaning of (Ti A T2,To) (not to be confused with (G ^ ^2,^3)) is 
that of T2 (resp. T3) if the vector denoted by Ti has value true (resp. false) at 
the processor name denoted by n (Fig. 1, right). 

In the following we will identify terms modulo renaming of bound variables 
and we will use Barendregt’s variable convention [2]: if terms t\, . . . tn occur in a 
certain mathematical context then in these terms all bound variables are chosen 
to be different from free variables. 

^ This set Xf can be defined in a general way as a set of closed /?-normal forms 
^ Terms of the form {Xx.T)t (resp. (Xx.T)T') are not /3-contracted and constitute 
implicit errors because they present a local argument to a global^global function 
(resp. a global argument to a local— ^global function). In practice a two- level type 
system should eliminate them, but we will not discuss this here. 
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Rules We now define the reduction of BSA terms. 

The reduction of local terms is simply /3-reduction, obtained from the local 
/3-contraction rule (1) with the usual context rules. 

The reduction of global terms is defined by syntax-directed rules and context 
rules which determine the applicability of the former. First, there are rules for 
global beta-equivalence (2) and (3). 

There are also axioms for the interaction of the vector constructor with the 
other BSP operations (4) and (5) where for all / G {0, . . . ,p— 1}, Ui is a processor 
name belonging to Af. The value of Ei 1 E 2 at processor name rii is the value 
of El at processor name given by the value of E 2 at rii. Notice that, in practical 
terms, this represents an operation whereby every processor receives one and 
only one value from one and only one other processor. This restriction can be 
lifted for the BSA-calculus [7] and for the BSAp-calculus. 

Next, the global conditional is defined by two rules (6) where n belongs to Af 
and T is Ti (resp. T 2 ) when S is true (resp. false). The two cases generate the 
following bulk-synchronous computation: first a pure computation phase where 
all processors evaluate the local term yielding to n; then processor n evaluates the 
parallel vector of booleans; if at processor n the value is true (resp. false) then 
processor n broadcasts the order for global evaluation of Ti (resp. T 2 ); otherwise 
the computation fails. Those two rules are necessary to express algorithms of 
the form: 

Repeat Parallel Iteration Until Max of local errors < epsilon 
because without them, the global control can not take into account data com- 
puted locally, ie global control can not depend on data. 

Rules of figure 2 are applicable in any context. 



(Ai. t)t' — > t[x ^ t'] ( 1 ) 

(Ax. T)T' — > T[x ^ T'] ( 2 ) 

(Ax. T)t' — > T[x ^ t'] ( 3 ) 

( to 7 ■ ■ ■ 5 ip—i )^( '^0 7 ■ ■ ■ ! rtp— 1 ) ^ ( to uq , . . . , tp—i Up— I ) ( 4 ) 

( to 7 ■ ■ ■ 7 tp— 1 )?( no 7 . . . , Up— I ) ^ ( tjiQ , . . . , trip-i ) ( 5 ) 

{{to,... , ,tp-i) ^Ti,T2)^ T (6) 

n 



Fig. 2. The BSAp calculus 

Examples The first example shows that the intensionality is not lost. The tt 
or parallel vector constructor of BSA can be defined as : 

7 T= A/.( /O ,... , /(p-1) ) 

The second one is the direct broadcast algorithm which broadcasts the value 
held at processor 0: i.e bcastO = Ax. if? 7 t(Ai.O). If applied to a vector it can 
be reduced as follows: 
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bcastO ( to , . . . , tp-i ) 

( to , . . . , tp-i ) ? Tr(Ai.O) (to, . . . , tp_i) ? ((Ai.O) 0, . . . , (Ai.O) (p - 1)) 

( to , . . . , tp_i ) ? (0, . . . ,0) ( to , . . . , to ) 

Confluence The BSAp-calculus is confluent. [7] presents BSAp as a higher-order 
rewrite system, and its confluence comes from a general theorem on the conflu- 
ence of such systems. The calculus presented here and the calculus expressed as 
an higher-order rewrite system are equivalent. The confluence is then obvious. 
By comparison the confluence of BSA required a much longer proof. 



3 Weak Call-by-Value Strategy 



For the BSAp-calculus it is possible to deflne two different reduction strategies 
for the two levels of the calculus. We choose here the same strategy for local 
and global reduction: weak call-by-value strategy. With such a strategy, codings 
as Church numerals are no longer usable. The calculus has to be extended with 
new constants and rules to deal at least with numbers and booleans. We will 
omit such constants for the sake of clarity^. 

The strategy can be roughly described as follows: (1) reduction is impossible 
in the scope of a A, the last two arguments of a global conditional cannot be 
reduced (2) for application, apply-par and get, the right argument is first eval- 
uated, then the left one ; for global conditionals the local term is first evaluated 
then the condition (3) for application, apply-par and get, the non-context rules 
can be applied only if the arguments are in normal form; for global condition- 
als the rule can be applied when term representing the processor name and the 
condition are in normal-form. 

More precisely we deflne the local values : constants, Xx.t, the global values : 
Xx.T, Xx.t, {vq , , Vp-i ) where Vi is a local value for all i G {0, ... — 1}. 

In the following set of rules v, vq, ... , V are values and n, no, .. . are processor 
names : 



(5'l)(Ai. t) V ■ 

m- 



■ t\x ■ 



{S2) 
t => 



t t' 



ut - 
■t' 



ut' 



(53) 



t 



t' 



tv ^ t' V 



(57) 



{to ,.. . , t , . . . , tp—l ) 

(S5){Xx.T)T' =» T[x^T'] 

T ^ T' T ^ T' 

(58); 



^ { to ,.. . , t' , . . . , tp—l ) 

(56)(Ai. r)T ^ T[x^t'] 
t=^t' T- 



UT^UT' ' 'TV^T'V 
(511)( Vo ,... , Vp-i )#{v'o ,... , v'p_i ) 



(59) 



(510); 



T' 



Tt^Tt ' 'Tv^T'v 

^ (vov'o ,... , Vp-lVp_i ) 



(512) 



T^T' 

U#T =>U#T' 



(513) 



T^T' 

T#V ^T'#V 



® Using the formulation of BSAp as a higher order rewrite system, it is very simple to 
verify that the calculus is still confluent. 
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(S14)( Vo , Vp-i )?( no , . . . , rip-i } ^ {v^o ,■■■ , ) 






UlT' 


(516) ^ => T' ? P 




(517) ((wo , 


. . . , tr^ , . . . 


, Wp_l )^Ti,T2) ^ 


Ti 


(518)(( wo , 


. . . , felM , . . 


. , Wp_l ) ATi,T2) ^ 


T 2 



(519)- 



(TATi,T 2) ^ (T^Ti.Tz) 



(520) 



T^T' 

(TATi,T 2) ^ (T'^Ti.Tz) 



Confluence On local terms, is a function. This property is easily proved by 
induction on the rules (51), (53) and (52). 

Lemma 1. On global terms is strongly confluent. If T Ti and T T 2 
then there exists T 3 such that T\ ^ T 3 and T 2 T 3 

Sketch of proof: By induction on T Ti. Each operation excludes the others. 
So for example if T Ti is (511) then T T 2 can only be (511), (513) or 
(512). But the arguments of # in T are values otherwise the rule (511) could not 
be applied: this excludes rule (513) and (512). Moreover the hypothesis must 
be the same, so T 2 = T^. 

This is similar for all operations: is also function on global terms but on 

parallel vectors. liT => T\ is (54), for example at processor i, then T ^ T2 can 
be (54) but at a different processor j. In this case, applying (54) at processor j 
on T 2 and (54) at processor i on Ti will lead to the same term T 3 . □ 

This proof highlights the fact that is non-deterministic only for parallel 
vectors. The non-determinism of corresponds to the asynchronous compu- 
tation phase of the BSP model. Some operations like # are also implemented 
to such phases but this feature is not well captured by the strategy. That 
is why a distributed evaluation has been designed [7]. It is non-deterministic 
for all operations but those which correspond to the communication and syn- 
chronization phases, namely the get and global conditional. Nevertheless, the 
strategy is necessary to prove the properties of the distributed evaluation and 
each one offers a different point of view. BSAp corresponds to the macroscopic 
view of data-parallel programs while the distributed semantics corresponds to 
the microscopic view or implementation. 



4 Parallel Reduction and Parallel Cost Model 

Any reduction within the above system corresponds to a BSP computation as 
follows. A term {to , , tp-i ) is implemented by storing terms ti on processor i 
for i = 0, . . . ,p—l. The part of global terms that lie outside (...) are replicated 
on every processor and, by design of the BSAp rules, vector constructors (. . . ) 
are never nested. As a result, we can associate a parallel cost to reductions. The 
cost is a vector of p numbers, the overall cost being the maximum of local costs. 
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1. A global application of (^S) or (5'6), or an application of (S'!) outside the 
vector constructor is applied by every processor to its local terms. This 
counts for one local operation on every processor: 

( Cl , . . . , Ci , . . . , Cp_i ) '^ ( Cl + 1 , . . . , Ci + 1 , . . . , Cp_i + 1 ) 

2. An application of (S'!) to one of the p terms in a vector construction is local 
to one processor: 

( Cl 5 ■ • ■ f Ci ^ . , Cp_l ) ^ ^ ( Cl , ■ • ■ 5 Ci “t“ 1 ; • ■ • ; Cp_l ) 

3. An application of (S'!!) involves one local operation on each processor: 

( Cl , . . . , Cj , . . . , Cp_i ) ^ ( Cl + 1 , . . . , Ci + 1 , . . . , Cp_i + 1 ) 

4. An application of (5'14) to a term ( to , fp-i )?( no , rip_i ) 

generates: (1) the request for data: the communication cost is g ■ h where g 
is the parallel architecture’s BSP parameter, and h the highest frequency 
of any integer within {no,... ,np_i} (a message contain here an integer, 
so its length is 1) (2) a synchronization barrier: the cost is L, the parallel 
architecture’s BSP parameter (3) the reception of data: the cost is g ■ h'’, 
where h'’ = maxo<i<p{si • ^{n|n = Ui}}, Si is the size of the local term ti 
and is the cardinality of S (4) another synchronization barrier. This 
barrier could be suppressed because each processor requested a known num- 
ber of data from other processors. So it is sufficient to count the number 
of incoming messages to known whether the superstep is completed or not. 
The Oxford BSPlib do not offer such zero cost barrier, so the current im- 
plementation uses 2 synchronization barriers. The cost is so either 0 or L. 
The BSP time required for an application of (5'14) is: g ■ h + g ■ + Lj 

where L? = L or 2L and g, L are the parallel architecture’s BSP parameters: 
( Cl , . . . , Ci , . . . , Cp_i ) ^ , maxo<i<pCi -I- g ■ h + g ■ + T?, . . . ) 

5. An application of (S'!?) or (S'18) involves the broadcast of the value (true 
or false) to every other processor, followed by a local operation to realize 
the branching of control towards T\ or T 2 - The BSP time associated to 
this application is therefore: g ■ {p — 1) + L + \ where g, L are the parallel 
architecture’s BSP parameters: 

( Cl , . . . , Ci , . . . , Cp_i ) ^ (. . . ,maxo<i<pCi -I- g^-{p-l) + + I,...) 

Following the above remarks it is possible to associate a BSP cost estimate to 
any reduction. We will illustrate the strategy and the cost model on a small 
example: 
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where sq is the size of the local term to- 
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It is very important to notice that such costing of reductions would be im- 
possible without the data structure (...) of BSAp and its parallel interpretation. 
The reason for this is that standard recursively-defined types like lists, do not re- 
fer to an explicit notion of process. When applying parallel evaluation strategies 
to lists there is, from the point of view of BSAp, excessive freedom in selecting 
redexes and parallelizing reductions. As a result, the exact parallel meaning (lo- 
cal on processor 0, local on processor 1, global, etc.) of a reduction is not fixed 
by the theory and therefore depends on the syntactic context of its application 
(at the beginning of a list, in the middle of it, etc.). 

5 Conclusions and Future Work 

We have design a calculus of BSP functional programs, a weak call-by-value 
strategy for this calculus and an associated parallel cost model. The formal- 
ism has the same advantages as our previous work BSA [8] but the parallel 
interpretation and the cost model are far more easy to express. Moreover its 
distributed evaluation [7] realizes the correspondence between the programming 
model (macroscopic) and the execution model (microscopic) of data-parallel pro- 
grams [4]. The distributed evaluation has been proved correct w.r.t the weak 
call-by-value strategy [7] but it remains to give a cost model to the distributed 
semantics and prove its equivalence to the cost model of the weak call-by- value 
strategy. 

The last formal step will be a distributed abstract machine that will be correct 
w.r.t the distributed semantics. We will then obtain a complete formal basis for 
the design of a complete programming environment containing: a polymorphic 
strongly typed parallel functional language (Bulk Synchronous ML), tools for 
performance prediction, tools to help to prove programs correction and tools to 
derivate programs, the derivation being driven by the costs. Such an environment 
will be particularly well suited to implement skeletons libraries based on BSP [10] 
or complex BSP algorithms. 
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Abstract. There are two classes of dataflow schemata DF and ADF. 
ADF is known to be equivalent to EF, but DF is not. ADF is given by 
strengthening with two devices compared with DF. One is recursion and 
the other is an arbiter which allows timing dependent processing. We are 
interested in whether both devices are necessary for ADF to have such 
a powerful expression ability. In this paper, we present relations between 
some dataflow schemata classes, and And that classes without recursion 
but with timing dependent processing are not powerful enough as ADF. 



1 Introduction 

A programming language of high expression ability is necessary to extract maxi- 
mum power from high performance computers. Currently, to examine expression 
ability of a programming language, a comparison is made using the functional 
classes which can be realized by a program schemata class [1] . 

The class EF of the effective functionals [2] is known to have the greatest 
expression ability. Jaffe[3] investigated the ability of the dataflow schemata class 
for the first time, and showed the class DF to be equivalent to the class EF in 
the total interpretations. 

However, Matsubara and Noguchi[4] have shown that DF is equivalent to the 
class EF^^ of the deterministic effective functionals in the partial interpretations. 

After that, Matsubara and Noguchi[5] proposed the class ADF of the dataflow 
schemata, and showed that the class is equal to EF in the partial interpreta- 
tions. ADF is strengthened with 2 devices which are compared with DF. One 
is an arbiter and another is a recursion. The arbiter is used to introduce the 
dependency of action on timing. To allow ADF to acquire such expression abil- 
ity, the question arises whether both of these are required or just one factor is 
sufficient. 

To answer such a question, it is necessary to examine the expression ability 
of a class having just one kind of device. 

Until now, the class RDF where the recursion is solely usable was investi- 
gated and it was found RDF{1) and RDF{oo) are also equivalent to EF'^[6]. 
Further, it was also found that ADF{oo) = EF[7]. Here,‘(l)’ means that each 
arc holds one token at most, and ‘(oo)’ means that arcs works as FIFO queues. 
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In this paper, we are concerned with the classes for which timing dependency 
is given and no recursion is included. 

2 Program Schemata 

In this paper, the function symbols shall be taken from the set T = {F1,F2,...} 
and the predicate symbols shall be taken from the set V = {P1,P2,...}. For 
element e of JP or P, let the number of the arguments of e be expressed as Re. 
For the sake of simplicity, let RFi> 1 and RPi > 1 with respect to arbitrary 
i > 1. On the other hand, variables shall be taken from the set X = {XI, X2,...}. 

The schema becomes a concrete program when interpretation is given. The 
program gives at most one result value by executing concrete calculation when 
input values are provided. Interpretation for a schema S gives the data domain 
D and gives maps for function symbols and predicate symbols. 

The partial interpretation is an interpretation which give maps not necessar- 
ily totally defined. Hereunder, interpretation means a partial one. 

Let the ‘computation’ composing an effective functional[2] be defined. 

Definition 2.1. (a) Let ~K. C X be a finite set of the input variables, letF C R 
be a finite set of the funetion symbols, and let P C V be a finite set of the 
predieate symbols. In this oeeasion, let A and II be the minimum set suffieing 
the following eonditions. 

LXc A 

II. For eaeh / G F and ei, ...,CRf G A, 

/(ei, ...,CRf) G A. 

III. For eaeh p G P and ei, ..., CRp G A, 

p(ei,...,eRp) G n and -.p(ei , ..., ejjp) G II 

where ’ is the symbol expressing the negation. 

The elements of A are ealled an expression eoneerning X and F, whereas the 
elements of II are ealled a proposition eoneerning X,F and P. 

(b) The eomputation is a sequenee eomprised of an expression and proposi- 
tion, and is finally terminated with the expression. 

Assume an interpretation I is given and the elements of D are given to 
the individual input variables. If the expressions and propositions eomposing the 
eomputation are all defined and the propositions have all true values, the eom- 
putation has a value. The said value is the one of the final expression. 

Definition 2.2. The effeetive funetional S, an element of the elass EF , is de- 
fined by the f-tuple S = <X,F,P, T >. This is a reeursively enumerable set of 
eomputations. Here, 

(1) Let X C A’,F C JP, and P C V be finite sets of the input variables, 
funetion symbols, and predieate symbols. 

(2) T indieate the Turing maehine, and outputs the i-th eomputation T(i) 
with the positive integer i as an input under proper eoding. 

(3) Provided that the eomputation for more than 2 inputs, e.g. T(i) and T(j) 
has values when interpretation and input values are given, let it be assumed that 
they are the same values. The said value is the value of S. 
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As a sub-class of EF, let the class of the deterministic effective functionals be 
defined. 

Definition 2.3. The class of the deterministic effective functionals EF^ is com- 
prised of the element S =<X,F,P, T > of EF, having the properties shown 
below. 

When interpretation and input are given and when i is the smallest number 
such that T(i) has a value, for arbitrary k such that k <i all the propositions or 
expressions in T(k) are defined until a proposition of the falsity value appears for 
the first time, assuming the propositions or expressions are screened successively 
from the left to right. 

Here, we introduce the class P of ordinary program schemata[l] , where simple 
variables can be used. Executable statements used in P are assignment state- 
ment, halt statement, goto statement and conditional statement. For detailed 
description see [8]. 

When an undefined value of a predicate or a function is to be evaluated 
during the execution, let it be considered that the execution entered an infinite 
loop and no result will be outputted. 

Other than the class P, we must introduce the class Pc. In this class, other 
than normal variables, a control variable, which holds at most one truth-and- 
falsity value, can be used. 

3 Dataflow Schemata 

The dataflow schema is a kind of program schema, but different from ordinary 
program schema is that the way of execution is made in parallel in accordance 
with the flow of the data. 

In the class of dataflow schemata, it is necessary to distinguish the case where 
an arc can hold at most one token and the case where an arc can hold arbitrary 
numbers of tokens. Either class is equal judged from a viewpoint of the syntax. 
When the arc holds arbitrary pieces of the tokens, the order of the tokens is 
maintained. That is to say, the arc plays the role of FIFO. 

Definition 3.1. The only kind of nodes included in the graph belonging to the 
class dfarb dfc depicted in Fig.l. 

A white arc is a control arc, a black one is a data arc, and a mixed one 
is either of these. The input arcs and the output arc of the schema are clearly 
depicted as Fig.l(o),(p). 

In the description below, let the figure and statement be used according to 
requirement on the assumption that they are equivalent to each other. 

To define the semantics of the graph, it is necessary to designate p = 1 or 
00 , where p = 1 means that each arc can hold at most one token, and p = oo 
means that each arc can hold arbitrary numbers of tokens. We attach (p) after 
the class name. 
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(m) not 
yC= ^ 




(n)arbiter 
y<' = Arb(x^,x^) 




(o)input 
Xj = Input 




(d) control T-gate 
yC = Tc (v=,x‘') 




(I) or 

y<== xf V 4 



CZ^ 



(p)output 
Output = y 



Fig. 1. Dataflow diagram and it’s statement 



Definition 3.2. Semantics of dfarbiX)- the schema of dfarbiX) ‘ts driven into an 
action as shown below when interpretation I is given and input data are provided 
to the individual input variables. 

Let it be assumed that the execution is made on a discrete time such as t = 0, 
1,2,.... On the assumption that u is a predicate symbol or function symbol and 
on the supposition that di,. . .,djiu G D, r(u,di,. . .,duu) gives the evaluation time 
of u(d\,. . .,duu)- A time function t, that gives infinity evaluation time when the 
value of u(d\,. . .,dRu) becomes undefined and that gives the evaluation time of 
positive integer values when the value is defined in the interpretation I, is called 
a time function consistent with the interpretation I. 

In the description hereunder, the action of the individual statements is ex- 
plained. 

y=u(jl{,. . where let u be a function symbol or predicate symbol: 

When tokens d\,. . .,dRu are placed on the individual input arcs and concur- 
rently no tokens are placed on y at the time t = tq, then at t=To-hT(u,di,. . .,dnu), 
the tokens on the individual input arcs are removed and the tokens of the evalu- 
ation result are placed on the output arc y. 

•if = Arb(j^,j^): 
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If a token is placed on zf and no token is placed on , then the token on zf 
is transferred onto y'^ . 

On the other hand when stf is empty and a token is placed on and concur- 
rently y'^ is empty then the token on is transferred to y'^ . 

Other nodes take only a unit time to execute. Because there is not enough 
space, for the detailed description, please refer to [8]. 

Let the value of the token to be placed on the output arc in the schema for 
the first time be the output result of the schema. 

As to the semantics for dfarb{oo), they are the same as that of dfarb{l-) but the 
action is made disregarding whether there is a token or not on the output arc(s). 

Definition 3.3. With respect to p =1 or oo, DFarb(p) is the class comprised 
of the ones sufficing the 2 conditions shown below in the schema of the class 

dfarb {pf ■ 

Condition: When interpretation I and input values are given, the same result 
is provided for an arbitrary time function t consistent with I. 

Here, some related classes of dataflow schemata are defined. 

Definition 3.4. With respect to p = l,oo, the class DF(p) is comprised of the 
elements of DFarb including no arbiters. 

Definition 3.5. A Boolean graph is an acyclic graph comprised exclusively of 
logical operators and control links of Fig. 1. 

For an arbitrary logical function, a Boolean graph realizing the function can 
easily be composed. As is shown in [4, 5], any kind of finite state machine can be 
composed by connecting the output arcs to the input arcs of a Boolean graph 
realizing an appropriate logical function. 

4 P = DF{1) 

In this section, let the ability of DF{1) be observed. Firstly, we observe that the 
relation DF{1) < Pc is satisfied. 

Because the schema S of DF{1) does not contain an element dependent on 
timing, a sequential routine can be written not only to check the necessary and 
sufficient condition for a token to be outputted on the output arc, but also to 
evaluate the value of the token [4, 6]. 

As an example we observe the cases of T-gate :y = T(v'^,x). 

Case A(When a token on y is demanded): 

(a) A token on v'^ is obtained. 

(b) A token on x is obtained. 

(c) If the token on v'^ is ‘T’ then go to (e). 

(d) Tokens on both input arcs are removed and then go to (a). 

(e) The token on v'^ is removed and the token on x is transfered 
to y. 

(f) Return. 
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Case B(When v'^ is demanded to be empty): 

(a) y is demanded to be empty. 

(b) A token on x is obtained. 

(c) If the token on v'^ is ‘T’ then go to (e). 

(d) Tokens on both input arcs are removed and then go to (f). 

(e) The token on v'^ is removed and the token on x is transfered 
to y. 

(f) Return. 

Case C(When x is demanded to be empty): 

(a) y is demanded to be empty. 

(b) A token on v'^ is obtained. 

(c) If the token on v'^ is ‘T’ then go to (e). 

(d) Tokens on both input arcs are removed and then go to (f). 

(e) The token on v'^ is removed and the token on x is transfered 
to y. 

(f) Return. 

Other cases can be analogized from the above examples. These routines mu- 
tually call each other on necessity. While executing a routine, it is possible to 
come again to a routine about the same statement. As an example, in a routine 
to output a token from a statement y = T{v^,x), the routine calls the other 
routine to obtain the token on if no token is there. While calling the other 
routine, it is possible for the statement to be demanded to make x empty. To 
meet the demand it is necessary to obtain the token on This means a contra- 
diction and then no token should be outputted on the original schema S. The 
routines get into a infinite loop and then no result is outputted. Thus the action 
of S can be simulated. 

Theorem 4.1. (DF(1) < Pc)- For arbitrary schema S of DF(1), we can com- 
pose the schema S' of Pc which is equivalent to S. 

Proof. The simulation in the above can be implemented by the schema S' . Each 
arc can be simulated by a pair of variables, one of which indicates whether there 
is a token or not and the other holds the value. By the fact that S includes only 
a finite number of statements, we can prepare a routine for each case and each 
statement. The return from each routine can be realized by a simple GOTO 
statement, since the point to which to return is definite for each routine. 

The first statement should be the one calling the routine which obtains the 
token on the output arc of S. Therefore, S' can simulate S □ 

In the next step, we observe the relation Pc < P is satisfied. 

Theorem 4.2. (Pc < P):For each schema S of Pc, we can compose an equiv- 
alent schema of P. 

Proof. For arbitrary control variable u in the schema S, we can construct an 
equivalent schema S' not including the variable u. We prepare a pair of state- 
ments for each statement in S according to the truth-falsity values of u. As an 
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example, for the statement‘w t— u • v’, two statements are given. On one hand, 
for the value ‘T’ of u, ‘w t— u’ is given. On the other hand, for the value ‘F’ of 
u, ‘w t— F’ is given. 

The statement TF u THEN statement-l ELSE statement_2’ is divided into 
two statements ‘statement-l’ and ‘statement_2’ in accordance with the value of 

u. 

We can repeat this modification until the schema includes no control variable. 

□ 



Now, we should indicate that P < DF{1) is satisfied. 

Theorem 4.3. (P < DF(1) ):For an arbitrary schema S of P, we can construct 
an equivalent schema S' of DF(1). 

Proof. S includes only a finite number of variables and also includes a finite 
number of statements. Therefore, it is not that difficult to simulate the sequential 
motion of S by S' which can act in parallel. As a matter of fact. S' can be 
constructed in a similar but easier manner than the procedure COMPUTE_l 
in [5]. S' includes a finite state machine which controls the evaluating part. 
The evaluating part includes some latches which hold values of each variable, a 
function evaluator, a predicate evaluator and a transfer network which transfers 
data tokens between the components. The finite state machine simulates the 
action of each statement in a sequential manner by controlling the evaluating 
part. When a HALT statement is simulated, the value is outputted on the output 
arc of 5'. □ 

Based on the three theorems above, we can present the relation that P = DF{1). 

Theorem 4.4. (P = DF(1)): DF(1) and P are equivalent to each other. 

Proof. Erom the three theorems above, we can conclude that P < DF{1) < 
Pc < P. Therefore we can conclude the relation. □ 

Corollary 4.1. (DF(oo) > DF(1)): DF(oo)includes DF(1) properly. 

Proof. It is known that P < EF'^ = DF{oo) [4]. Therefore, we can conclude the 
result from Theorem 4.4. □ 

5 DFarbil) > P 

Since DFarbil-) can use an arbiter as a device which cannot be used by DF{1), it 
is evident that DFarbil-) > -^>-^’(1)- Therefore, it is easy to see that DFarbil-) > P 
by the relation P = DF{1). In this section, it is considered whether the inclusion 
is proper or not. 

Here, we introduce an example which can be used for the test. Example A 
is indicated in Eig.2. When both of predicate Pi{x) and P^ix) have the value 
T, the value of the output token is x, disregarding which predicate outputs the 
result faster than the other. Therefore, the example is an element of the class 
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F i g. 2 Examp I e A 



Lemma 5.1. Example A cannot be simulated by a schema of EF^ . 

Proof. Here, we assume that a schema S equivalent to example A belongs to the 
class EF'^. S should have the same sets of input variables, predicate symbols 
and function symbols, and the same output variable as example A. 

We introduce three interpretations 7 i, I2 and I3 which have the same data 
domain D = {a}. 

On 7 i, Pi(a) = T and 72 (a) = undefined. 

On I2, 7*1(0) = undefined and 7*2(0) = T. 

On I3, Pi(a) = 7*2(0) = undefined. 

On 7 i, the first computation in S which has a value is assumed to be the 
ii-th computation. On 72 , the i2-th. 

Firstly, we assume that i\ equals ^2- This computation should not include 
Pi{x) or P2{x). The reason is as follows: If 7 *i(a;) is included, then the computa- 
tion is undefined on 72 . If 7*2(2:) is included, then the computation is undefined 
on 7 i. 

If these predicates are not included, the computation should have a value on 
I3. This contradicts the assumption that S is equivalent to example A. 
Therefore i\ should not equal i2- 

Next, we assume that i\ < i2- On I2, the ii-th computation should not have 
a value. Therefore, it should include -1^2(2;). This means that computation 
becomes undefined on 7 i. When 12 < *i is assumed, a similar contradiction is 
deduced. □ 

Therefore, we can conclude the next theorem. 

Theorem 5.1. (DFarbiX) > P)' The class DFarbiX) properly includes the class 

P. 



Proof. It is evident that example A cannot be simulated by a schema of P by 
Lemma 5.1 and the relation EF'^ > P. Therefore we can conclude the result by 

DFarbd) >P- □ 
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6 EF> DFarbi^) 

It is easy to show that the relation EF > DFarbi^) is satisfied. To reveal whether 
the inclusion is a proper one or not, we introduce an example B. 

Example B:This is a schema of EF^^ including x as the input variable, L 
and R as the function symbols and P as the predicate symbol. The computations 
enumerated by the schema are as follows: 
lst:< P(x),x > 

2nd:< PL{x),x > 

3rd:< PR{x),x > 

4th: < PLL{x),x > 

5th: < PLR{x),x > 

6th: < PRL{x),x > 



Here, we use an abbreviation for the propositions. As an example PRL{x) is 
the abbreviation of P{R{L{x))). 

This schema evaluates the computations in a fixed order, and if it finds the 
computation having a value, then the value is outputted as the result of the 
schema. While evaluating a computation, if the value of a proposition or an 
expression becomes undefined, then the schema’s value also becomes undefined. 

Lemma 6.1. There is no sehema of DFarbiX) whieh is equivalent to example 

B. 

Proof. We assume 5 is a schema of the class DFarb{l-) which is equivalent to 
example B. We can imagine an interpretation I which gives ‘T’ for a proposition 
of a length and gives ‘F’ for other propositions. If the length is greater than the 
number of arcs included, S cannot hold all the intermediate values. This is a 
contradiction. □ 

Theorem 6.1. EF > DFarbiX) 

Proof. If we assume that EF = DFarb(l-), then DFarb(l-) > EF'^ is concluded 
by EF > EF^. This contradicts Lemma 6.1. □ 

7 Conclusions 

We revealed the structure of inclusion relations between some dataflow sche- 
mata classes. Adding some equivalence relations [4-6], the whole structure is 
illustrated in Fig. 3. Here, a ^ (3 means that a is properly included in (3. 

We list up important points which can be concluded from the figure as follows: 

(1) Both timing dependency and recursion are necessary for the class of 
dataflow schemata to become equivalent to the class EF of effective functionals. 

(2) Timing dependency and recursion take different roles from each other 
in dataflow schemata. Recursion is strong enough to make RDF{1) stronger 




Ability of Classes of Dataflow Schemata with Timing Dependency 



373 



than DF{1), while it cannot make RDF{oo) stronger than DF{oo). Timing 
dependency makes DFarbi^) stronger than DF{1). 

(3) The value of /t takes rather a minor role. It has an effect only when re- 
cursion is not included. 



ADF(I) EADF(oo) =Ef 




Fig. 3 Inclusion relations between classes. 

We have investigated the expression abilities of dataflow languages by the 
classes of functionals which can be realized. It seems that the conclusions are 
very suggestive for the language design of any kind of parallel processing. 



References 

1. Constable , R. L. and Cries , D. , “On Classes of Program Schemata”, SIAM Journal 
on computing ,1,1, pp. 66-118(1972). 

2. Strong , H. R. , “High Level Languages of Maximum Power” , PROC. IEEE Conf. 
on Switching and Automata Theory , pp. 1-4(1971). 

3. Jaffe , .1. M. , “The Equivalence of r.e. Program Schemes and Data flow Schemes” 
, JCSS , 21 , pp.92-109(1980). 

4. Matsubara , Y. and Noguchi , S. , “Expressive Power of Dataflow Schemes on Partial 
Interpretations” , Trans, of lECE Japan , J67-D , 4 , pp. 496-503 (1984). 

5. Matsubara , Y. and Noguchi , S. , “Dataflow Schemata of Maximum Expressive 
power” , Trans, of lECE Japan , J67-D , 12 , pp.l411-1418(1984). 

6. Matsubara, Y. “Necessity of Timing Dependency in Parallel Programming Lan- 
guage”, HPC-ASIA 2000, The Fourth International Conference on High Performance 
Computing in Asia-Pacific Region. May 14-17,2000 Bejing, China. 

7. Matsubara, Y. ,“ADF{oo) is Also Equivalent to EF” , The Bulletin of The Faculty 
of Information and Communication , Bunkyo University. (1995). 

8. Matsubara, Y. and Miyagawa, H. “Ability of Classes of Dataflow Schemata with 
Timing Dependency”, http://www.bunkyo.ac.jp/ matubara/timing.ps 



A New Model of Parallel Distributed Genetic 
Algorithms for Cluster Systems: 

Dual Individual DGAs 



Tomoyuki Hiroyasu, Mitsunori Miki, Masahiro Hamasaki, and Yusuke 

Tanimura 

Department of Knowledge Engineering and Computer Sciences, Doshisha University 
1-3 Tatara Miyakodani Kyotanabe Kyoto 610-0321, Japan 
Phone: -f81-774-65-6638. Fax: -f81-774-65-6780 
tomo@is.doshisha.ac.jp 



Abstract. A new model of parallel distributed genetic algorithm. Dual 
Individual Distributed Genetic Algorithm (DuDGA), is proposed. This 
algorithm frees the user from having to set some parameters because 
each island of Distributed Genetic Algorithm (DGA) has only two indi- 
viduals. DuDGA can automatically determine crossover rate, migration 
rate, and island number. Moreover, compared to simple GA and DGA 
methods, DuDGA can find better solutions with fewer analyses. Capa- 
bility and effectiveness of the DuDGA method are discussed using four 
typical numerical test functions. 



1 Introduction 

The genetic algorithm (GA) (Goldberg, 1987) is an optimization method based 
on some of the mechanisms of natural evolution. The Distributed Genetic Algo- 
rithm (DGA) is one model of parallel Genetic Algorithm (Tanese, 1989; Belding, 
1995; Miki, et. ah, 1999). In the DGA, the total population is divided into sub- 
populations and genetic operations are performed on several iterations for each 
sub-population. After these iterations, some individuals are chosen for migra- 
tion to another island. This model is useful in parallel processing as well as in 
sequential processing systems. The reduced number of migrations reduces data 
transfer, so this model lends itself to use on cluster parallel computer systems. 
Moreover, DGA can derive good solutions with lower calculation cost as com- 
pared to the single population model (Gordon and Whitely, 1993; Whitley, et. 
ah, 1997). 

Genetic Algorithms (GA) require user-specified parameters such as crossover 
and mutation rates. DGA users must determine additional parameters includ- 
ing island number, migration rate, and migration interval. Although Cantu-Paz 
(1999) investigated DGA topologies, migration rates, and populations, the prob- 
lems relating to parameter setting remain. 

The optimal number of islands was investigated for several problems and it 
was found that a model with a larger number derives better solutions when the 
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total population size is fixed. Using this result, a new algorithm called Dual Indi- 
vidual Distributed Genetic Algorithm (DuDGA) is proposed. In DuDGA, there 
are only two individuals on each island. Since DuDGA has only two individuals 
per island, crossover rate, migration rate, and island number are determined au- 
tomatically, and the optimum solution can be found rapidly. The capability and 
effectiveness of DuDGA and its automatic parameter setting and lower calcula- 
tion cost are discussed using four types of typical numerical test functions. The 
results are derived using a sequential processing system. 

2 Dual Individual Distributed Genetic Algorithms 

Distributed Genetic Algorithms (DGAs) are powerful algorithms that can de- 
rive better solutions with lower computation costs than Canonical GAs (CGAs). 
Therefore, many researchers were studied on DGAs (Nang, et. ah, 1994, Whit- 
ley, et. ah, 1997; Munemoto, et. ah, 1993, Gordon and Whitley, 1993). However, 
DGAs have the disadvantage that they require careful selection of several param- 
eters, such as the migration rate and migration intervals, that affect the quality 
of the solutions. 

In this paper, we propose a new model of Distributed Genetic Algorithm. This 
proposed new model of DGAs is called ’’Dual Individual Distributed Genetic 
Algorithms” (DuDGAs). DuDGAs have only two individuals on each island. 
The concept is shown in Figure 4. 



sub population crossover 

size = 2 rate =1.0 



migration 
rate - 0.5 




Fig. 1. Dual Individual Distributed Genetic Algorithms 



In the proposed DuDGA model , the following operations are performed. 

- The population of each island is determined (two individuals). 

- Selection: only individuals with the best fit in the present and in one previous 
generation are kept. 

- Migration method: the individual who will migrate is chosen at random. 

- Migration topology: the stepping stone method where the migration desti- 
nation is determined randomly at every migration event. 

One of the advantages of the DuDGA is that users are free from setting 
some of the parameters. By limiting the population to two individuals on each 
island, the DuDGA model enables the following parameters to be determined 
automatically: 
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- crossover rate: 1.0 

- number of islands: total population size/2 

- migration rate: 0.5 

However, because each island has only two individuals, several questions arise. 
Does the DuDGA model experience a premature convergence problem? Even 
when the DuDGA can find a solution, does the solution depend on the operation 
of mutation? The numerical examples clarify answers to these questions. The 
examples also demonstrate that the DuDGA model can provide higher reliability 
and achieve improved parallel efficiencies at a lower computation expense than 
the DGA model. 



3 Parallel Implementation of DuDGA 

The schematic of the parallel implementation of the DuDGA model is presented 
in Figure 5. This process is performed as follows: 

1. The islands are divided into sub groups. Each group is assigned to one pro- 
cessor. 

2. DuDGA is performed for each group. During this step, migration occurs 
within the group. 

3. After some iterations, one of the islands in each group is chosen and is moved 
to the other group. 

4. Step 2 is repeated for the newly formed groups. 

Limiting the migration of islands between groups keeps network traffic (data 
transfer) at a minimum. The schematic in Figure 5 corresponds to a DuDGA 
implemented on two parallel processors. 



nrocess 2 



process 1 



GA Operator 



process 2 




Migration 



process 1 
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Fig. 2. Parallel Implementation of DuDGA 



4 Numerical Examples 

This section discusses numerical examples used to demonstrate the DuDGA 
model. The effects of the number of islands and population size on the per- 
formance of DuDGA are presented. The reliability, convergence, and parallel 
efficiency of the algorithm are discussed. 
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4.1 Test Functions and Used Parameters 

Four types of numerical test functions (Equations 1-4) are considered. 






F2 



Fs 



Fi 
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The number of design variables (ND), the number of bits (NB) and the charac- 
teristics of the test functions are summarized in Table 1. 



Table 1. Test Functions 





Function Name 


ND 


NB 


Fi 


Rastrigin 


20 


200 


F2 


Rosenbrock 


5 


50 


Fs 


Griewank 


10 


100 


Ft 


Ridge 


10 


100 



It is easy for GAs to derive solutions using the Rastrigin function (FI) be- 
cause it is a linear function of the design variables. Conversely, it is difficult for 
GAs to find solutions using non-linear functions such as the Rosenbrock (F2) 
and Ridge (F4) functions. The degree of difficulty in finding solutions using the 
Griewank function (F3) is in the range between that for FI and F2. Table 2 
summarizes the parameters specified for the DGA and DuDGA operators. 

The algorithm is terminated when the number of generations is more than 
5,000. Results shown are the average of 20 trials. The DGA needs several param- 
eters which users must set. However, since the DuDGA has only two individuals 
in its islands, with the exception of population size and the migration interval, 
the parameters are automatically determined. 
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Table 2. Used Parameters 





DGA 


DuDGA 


Crossover rate 


1.0 


1.0 


Population size 


240 


240 


Mutation rate 


1/L 


1/L 


Number of islands 


4, 8, 12, 24 


120 


Migration rate 


0.3 


0.5 


Migration interval 


5 


5 



L : Chromosome length 



4.2 Cluster System 

In this paper, the simulations are performed on a parallel cluster that is con- 
structed with 16 Pentium II (400 Mhz) personal computers (PCs) and fast eth- 
ernets. This cluster is similar to a Beowulf cluster and has normal networks. 
Therefore, increase in network traffic decreases the parallel efficiency. 



4.3 Effects of the Number of Islands 

The effect of the number of islands on reliability and convergence of the DGA 
are discussed in this section. Reliability is the fraction of times during 20 trails 
that an optimum was found. The reliability of DGA for the four test functions 
and varying number of islands is shown in Figure 6. 



□ 4island 
0 Sisland 

□ IZisland 
■ 24island 



FI F2 F3 F4 




Fig. 3. Reliability of DGA 



Figure 6 shows that the reliability of the DGA increases with the number of 
islands for test functions FI, F3, and F4. F2 is a problem that GAs are not good 
at finding solutions. Therefore, DGAs can find good results in F2. 

Figure 7 shows the number of evaluations needed to located the optimum 
solution. A substantial portion of the computation effort is spent in evaluating 
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fitness functions an hence a smaller number of calls for function evaluations is 
desirable. 




FI F2 F3 F4 



□ 4island 

□ Sisland 
n 12island 
■ 24island 



Fig. 4. Number of Evaluations 



The results presented in Figure 7 indicate that the DGA requires the least 
number of function evaluations with highest number of islands. Hence, it can 
be concluded that the DGA should have as many islands as possible. DuDGA 
exploits this characteristic by maximizing the number of islands and minimizing 
the number of individuals. 



4.4 Evaluation of DuDGA Performance 

Reliability and Convergence Figures 8 and 9 show the reliability and the 
number of function evaluations for convergence of DuDGA. 



□ 4island 

□ Sisland 

□ 12island 

■ 24island 

■ Du-DGA 



FI F2 F3 F4 



■§ 0.5 



. . [ 



Fig. 5. Comparison of Reliability of DGA and DuDGA 



Figures 8 and 9 show that DuDGA exhibited higher reliability and faster 
convergence for all four test functions when compared to DGA. Figures 10 and 
11 show the iteration histories of the objective function and hamming distance 
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□ 4island 
H Sisland 

□ 12island 

■ 24island 

■ Du-DGA 



Fig. 6. Comparison of Number of Function Evaluations Required by DGA and DuDGA 



values. The Hamming distance is a measure of the difference between two strings, 
in this case the binary coded chromosome. 




Generation 

Fig. 7. Iteration History of the Objective Function Value 



In figure 10, it is found that the evaluation values of DuDGA are not good 
at the first generations of the process. Then in the latter process, thoes are 
better than thoes of other DGAs. It can be said in the same thing in figure 11. 
The diversity of the solutions can be found from the hamming distances. When 
the hamming distance is big, the GA still has diversity of the solutions. In the 
first generations of the process, there is a high diversity in DuDGA. On the 
other hand, in the latter generations, solutions are converged to the point and 
DuDGA lost the diversity quickly. Compared to other DGAs, the convergence of 
DuDGA is slower during the first generations of the process. This is because the 
DuDGA is searching for global rather than local solutions. Later, when DuDGA 
is searching for local solutions, values converge quickly, and the model finds 
better solutions than did the other DGAs. 
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Generation 



Fig. 8. Iteration History of the Hamming Distance to the Optimum Solution 



Parallel Efficiency Calculation speed up of DuDGA for the Rastrigin function 
(FI) using multiple parallel processors are shown in Figure 13. Results are for 
fixed population size and number of islands but with varying number of groups. 




The number of Processors 



Fig. 9. Speed Up (one process per one processor) 



The speed up of the DuDGA implemented on multiple processors is more 
than linear. The speed up is much higher initially (1-10 processors) and levels 
off later (14-16 processors). There are two significant reasons for this high paral- 
lel efficiency. First, DuDGA limits information flow between processes (groups) 
thereby reducing data transfer and traffic jams. The second reason is that the 
total number of calculations that must be performed is reduced by distributing 
the computation processes. 

In Figure 14, computation times of the PC cluster system are shown. These 
results are obtained for the 16-group DuDGA problem. When there are two pro- 
cessors, each processor has 8 groups. When there are 8 processors, each processor 
has 2 groups. As Figure 14 shows, when the number of processors increases, cal- 
culation time decreases. When the PC cluster with fewer processors than groups 
is used, each processor must perform several processes. This is not an efficient 



382 



Tomoyuki Hiroyasu et al. 



2500 

2000 
S 

o 

E 1500 

c 
o 

re 1000 

3 

o 
re 

^ 500 

0 

12 4 8 16 

Number of processors 

Fig. 10. Computation time (16 processes) 




option when a Linnx based PC clnster system is nsed. In order to maximize 
efficiency, the process threads shonld be parallelized or the nser shonld nse the 
same nnmber of processors as gronps. 

5 Conclusions 

This paper presents a new model of parallel distribnted genetic algorithm called 
”Dnal Individnal Distribnted Genetic Algorithms.” The DnDGA model was ap- 
plied nsing fonr typical test fnnctions-Rastrigin, Rosenbrock, Griewank, and 
Ridge-to find optimnm solutions on PC cluster systems. DuDGA’s use of only 
two individuals in each island enables it to determine some GA parameters auto- 
matically. This reduces the time required to implement analyses, reduces prob- 
lems associated with inadequate parameter selection, and decreases processing 
time. The evaluation of the method with the test cases examples leads to the 
following conclusions: 

- When the total population size is fixed, the more islands there are, the faster 
the convergence. The DnDGA exploits this characteristic. 

- Compared to the DGA where the number of islands is relatively small, 
DnDGA can derive better solutions with a smaller number of function eval- 
uations. 

- The DnDGA searches using a crossover operation; it cannot search effectively 
while using only the mutation operation. 

- DnDGA performs global searches during the first generations. In the lat- 
ter part of the analysis, convergence proceeds rapidly, and a local search is 
performed. 

- When the population size is small, a standard GA cannot find an optimum 
solution due to premature convergence. When the population size is large, 
GA can derive an optimum solution. However, computation time is wasted. 
DnDGA does not waste much computational effort, even when the popula- 
tion size is large. 
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- Because of its high efficiency resulting from the reduced data transfer be- 
tween groups, DuDGA is an effective method for performing Genetic Algo- 
rithms on distributed parallel cluster systems. 
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Abstract. OpenMP is a programming API for shared memory computers. It 
is supported by most of the major vendors of shared memory computes and 
has in the few years since it came out, become one of the major industry 
standards for parallel programming. In this paper, we introduce the latest 
version of OpenMP: version 2.0. 



1 Introduction 

OpenMP is an Application Programming Interface (API) for writing multi-threaded 
applications for shared-memory computers. OpenMP lets programmers write code 
that is portable across most shared memory computers. 

The group that created OpenMP - the OpenMP architecture review board (or 
ARB) — included the major players in the world of shared memory computing: DEC 
(later Compaq), HP, IBM, Intel, SGI, SUN, KAI, and to provide a user point of view, 
scientists from the U.S. Department of Energy’s ASCI program. We started our work 
in 1996 and completed our first specification in late 1997 (Fortran OpenMP 
version 1.0). This was followed by OpenMP 1.0 for C and C++ in late 1998 and a 
minor upgrade to the Fortran specification in late 1999 (version 1.1). 

OpenMP 1.1. is well suited to the needs of the Fortran?? programmer. 
Programmers wanting to use modem Fortran language features, however, were 
fmstrated with OpenMP 1.1. For example, threadprivate data could only be defined 
for common blocks, not module variables. OpenMP 2.0 was created to address the 
needs of these programmers. We started work on OpenMP 2.0 in the spring of 1999. 
It should be complete by late 2000. 

In this paper, we provide a brief introduction to OpenMP 2.0. We begin by 
describing the OpenMP ARB and the goals we use for our work. Understanding these 
goals, will make it easier for readers to understand the compromises we made as we 
created this new specification. Next, we provide a quick introduction to the major 
new features in OpenMP 2.0. 

Throughout this paper, we assume the reader is familiar with earlier OpenMP 
specifications. If this is not the case, the reader should go to the OpenMP web site 
(www.openmp.org) and download the specifications. 
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2 The OpenMP ARB 

OpenMP is an evolving standard. Rather than create the specifications and then go 
our separate ways, the creators of OpenMP formed an ongoing organization to 
maintain OpenMP, provide interpretations to resolve inevitable ambiguities, and 
produce new specifications as needed. This organization is called the OpenMP 
Architecture Review Board - or the ARB for short. The official source of information 
from the ARB to the OpenMP-public is our web site: www.openmp.org. 

Companies or organizations join the ARB - not people. Members fall into two 
categories. Permanent members have a long term business interest in OpenMP. 
Typically, permanent members market OpenMP compilers or shared memory 
computers. Auxiliary members serve for one year terms and have a short term or 
non-business interest in OpenMP. For example, a computing center servicing several 
teams of users working with OpenMP might join the ARB so their users could have a 
voice in OpenMP deliberations. Regardless of the type of membership, each member 
has an equal voice in our OpenMP discussions. 

To understand the design of OpenMP and how it is evolving, consider the ARB’s 
goals: 

• To produce API specifications that let programmers write portable, efficient, 
and well understood parallel programs for shared memory systems. 

• To produce specifications that can be readily implemented in robust 
commercial products, i.e. we want to standardize common or well 
understood practices, not chart new research agendas. 

• To whatever extent makes sense, deliver consistency between programming 
languages. The specification should map cleanly and predictably between C, 
Fortran, and C++. 

• We want OpenMP to be just large enough to express important, control- 
parallel, shared memory programs — but no larger. OpenMP needs to stay 
"lean and mean". 

• Legal programs under an older version of an OpenMP specification should 
continue to be legal under newer specifications. 

• To whatever extent possible, we will produce specifications that are 
sequentially consistent. If sequential consistency is violated, there should be 
documented reasons for doing so. 

We use this set of goals to keep us focused on that delicate balance between 
innovation and pragmatic technologies that can be readily implemented. 



3 The Contents of OpenMP 2,0 



OpenMP 2.0 is a major upgrade of OpenMP for the Fortran language. In addition to 
the ARB goals discussed earlier, the OpenMP 2.0 committee had three additional 
goals for the project: 
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• To support the Fortran95 language and the programming praetiees Fortran95 
programmers typieally use. 

• To extend the range of applications that can be parallelized with OpenMP. 

• OpenMP 2.0 will not include features that prevent the implementation of 
OpenMP conformant Fortran?? compilers. 

At the time this paper is being written, the specification is in draft form and open to 
public review. Any interested reader can download the document from the OpenMP 
web site (www.openmp.org). Here are some key statistics about the OpenMP 2.0 
specification: 

• The document is 1 15 pages long. 

• 56 pages for the specification itself 

• 53 pages of explanatory appendices — including 28 pages of examples. 

Note that the specification itself is not very long. It can easily be read and 
understood in a single sitting. Another nice feature of the specification is the 
extensive set of examples we include in the appendices. Even with a simple API such 
as OpenMP, some of the more subtle language features can be difficult to master. We 
have tried to include examples to expose and clarify these subtleties. 

The changes made to the specification in moving from OpenMP 1.1 to OpenMP 2.0 
fall into three categories: 

• Cleanup: Fix errors and oversights in the 1.1 spec and address consistency 
issues with the C/C++ spec. 

• Fortran90/95 support. 

• New functionality: New functionality to make OpenMP easier to use and 
applicable to a wider range of problems. 

In the next three sections, we will briefly outline the OpenMP 2.0 features in each 
category. We will not define each change in detail, and refer the reader interested in 
such details to the draft specification itself (available at www.openmp.org). 



4 OpenMP 2.0 Features: Cleanup 

We made some mistakes when we created the Fortran OpenMP specifications. In 
some cases, we didn’t anticipate some of the ways people would use OpenMP. In 
other cases, we forgot to define how certain constructs would interact with other parts 
of the Fortran language. In a few cases, we just plain “got it wrong”. 

In OpenMP 2.0, we fixed many of these problems. The main changes we made 
under this category are: 

• Relax reprivatization rules 

• Allow arrays in reduction clause 

• Require that a thread cannot access another thread’s private variables. 

• Add nested locks to the Fortran runtime library 

• Better define the interaction of private clauses with common blocks and 
equivalenced variables 
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• Allow comments on same line as directive. 

• Define how a STOP statement works inside a parallel region. 

• Save implied for initialized data in f77. 

In figure 1, we show an example of some of these “cleanup-changes”. Notice in 
line 6 of the program, we have an IF statement that has a STOP statement in its body. 
Most OpenMP implementations support the use of STOP statements, but the 
specification never stated how this statement worked in the context of OpenMP. In 
OpenMP 2.0, we have defined that a STOP statement halts execution of all threads in 
the program. We require that all memory updates occurring at the barrier (either 
explicit of implicit) prior to the STOP have completed, and that no memory updates 
occurring after the subsequent barrier have occurred. 

A more important change is seen in the reduction clause. In OpenMP 1. 1, we only 
allow scalar reduction variables. This was a serious oversight on our part. It doesn’t 
matter to the compiler implementer if the reduction variable is a member of an array 
or a scalar variable. Given that many applications that use reductions need to reduce 
into the elements of an array, we changed the specification to allow arrays as 
reduction variables. 



real f orce (NMAX, NMAX) 
logical FLAG 

force (1 :NMAX, 1:NMAX) = 0.0 

C$OMP parallel private (fij, FLAG) 
call big_setup_calc (FLAG) 
if (FLAG) STOP 

C$OMP do private (fij, i, j) reduction (+: force) 
do i=0,N 

do j=low(i) ,hi (i) 

fij = potential ( i , j ) 
force(i,j) += fij 
force(j,i) += -fij 
end do 
end do 

C$OMP end parallel 



Fig. 1. This program fragment shows several different features that we cleaned 
up in OpenMP 2.0. First, we defined the semantics of a STOP statement inside an 
OpenMP parallel region. Second, it is now possible to privitize a private variable 
(fij in this program fragment). Finally, it is now possible to use an array in a 
reduction clause 



In OpenMP 1. 1, we did not allow programmers to privitize a variable that had 
already been privatized. This was done since we thought a programmer would never 
do this intentionally and therefore, we were helping assure more correct programs 
with the re-privitization restriction. Well, in practice, this restriction has only irritated 
OpenMP programmers. Since there is no underlying reason compelling us to add this 
restriction to the language, we decided to drop it in OpenMP 2.0. You can see this in 
figure 1 where we privatized the variable fij on the PARALLEL construct and again 
on the enclosed DO construct. 
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5 OpenMP 2,0 Features: Addressing the Needs of Fortran95 

The primary motivation in creating OpenMP 2.0 was to better meet the needs of 
Fortran95 programmers. We surveyed Fortran95 programmers familiar with OpenMP 
and found that most of them didn’t need much beyond what was already in 
OpenMP 1.1. We met their needs by adding the following features to OpenMP: 

• Share work of executing array expressions among a team of threads. 

• Extend Threadprivate so you can make variables threadprivate, not just 
common blocks. 

• Interface declaration module with integer kind parameters for lock variables. 

• Generalize definition of reduction and atomic so renamed intrinsic functions 
are supported. 

We show an example of work sharing array expressions in figure 2. This program 
fragment shows how a simple WORKSHARE construct can be used to split up the 
loops implied by array expressions between a team of threads. Each array expression 
is WORKSHARE’ed separately with a barrier implied at the end of each one. Notice 
that the WORKSHARE statement doesn’t indicate how this sharing should take place. 
We felt that if such detailed control was needed, a programmer had the option to 
expand the loops by hand and then use standard “OMP DO” construct. 



Real, dimension (n, m, p) : 


: a, b, c, d 


! $omp parallel 
! $omp workshare 




a = b * c 
d = b - c 




! $omp end workshare 
! $omp end parallel 




1 Fig. 2. A program fragment showing how to share the work from array statements I 


1 among the threads in a team 





In Figure 3, we show how the threadprivate construct can be used with variables (as 
opposed to named common blocks). This is an important addition to OpenMP 2.0 
since Fortran95 programmers are encouraged to use module data as opposed to 
common blocks. With the current form of the threadprivate construct, this is now 
possible within OpenMP. 

PROGRAM P 

REAL A (100) , B (200) 

INTEGER UK 

! $ THREADPRIVATE (A,B, UK) 

! $omp parallel copyin(A) ! Here's an inline comment 

! $omp end parallel 



Fig. 3. This program fragment shows the use of the threadprivate construct with 
variables as opposed to common blocks. Notice the program also shows yet another 
new OpenMP feature: an in-line comment 
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6 OpenMP 2,0 Features: New Functionality 



A continuing interest of the ARB is to make OpenMP applicable to a wider range 
of parallel algorithms. At the same time, we want to make OpenMP more convenient 
for the programmer. To this end, we added the following new functionality to 
OpenMP: 

• OpenMP routines for MP code timing 

• The NUM_THREADS() clause for nested parallelism. 

• COPYPRIVATE on single constructs. 

• A mechanism for a program to query OpenMP Spec version number. 

In figure 4, we show several of these new features in action. First, note the use of 
the runtime library routines OMP_GET_WTIME(). These were closely modeled after 
those from MPI and they return the time in seconds from some fixed point in the past. 
This fixed point is not defined, but it is guaranteed not to change as the program 
executes. Hence, these routines can be used to portably find the elapsed wall-clock 
time used in a program. 

Another important new feature in figure 4 is the NUM THREADS clause. This 
clause lets the programmer define how many threads to use in a new team. Without 
this clause, the number of threads could only be set in a sequential reason preventing 
programs with nested parallel regions from changing the number of threads used at 
each level in the nesting. 

Double precision start, end 
start = OMP_GET_WTIME ( ) 

! $OMP PARALLEL NUM_THREAD S ( 8 ) 

.... Do a bunch of stuff 
! $OMP PARALLEL DO NUM_THREADS ( 2 ) 
do 1=1,1000 

call big_calc (I , results) 
end do 

! $OMP END PARALLEL 

end = OMP_GET_WTIME 0 

print seconds = ', end- start 



Fig-4. This program fragment shows how a threads clause can be used to suggest 
a different number of threads to be used on each parallel region. It also provides 
an example of the OpenMP wallclock timer routines 



Another new feature in OpenMP 2.0 is the COPYPRIVATE clause. This clause can 
be applied to a END SINGLE construct to copy the values of private variables from 
one thread to the other threads in a team. For example, in figure 5, we show how the 
COPYPRIVATE clause can be used to broadcast values input from a file to the 
private variables within a team. Without COPYPRIVATE, this could have been 
done, but only be using a shared buffer. The COPYPRIVATE is much more 
convenient and doesn’t require the wasted space implied by the shared buffer. 
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REAL A,B, X, Y 
COMMON /XY/ X,Y 
! $OMP THREADPRIVATE (/XY/) 

! $OMP PARALLEL PRIVATE (A, B) 

! $OMP SINGLE 

READ (11) A, B, X, Y 
! $OMP END SINGLE COPYPRIVATE ( A,B, /XY/) 
! $OMP END PARALLEL 
END 



Fig-5. This program fragment shows how to use the copyprivate clause to 
broadcast the values of private variables to the corresponding private variables on 
other threads 



7 Conclusion 

Computers and the way people use them will eontinue to ehange over time. It is 
important, therefore, that the programming standards used for eomputers should 
evolve as well. 

OpenMP is unique among parallel programming API’s in that it has a dedieated 
industry group working to assure it is well tuned to the needs of the parallel 
programming eommittee. OpenMP 2.0 is the latest projeet from this group. The 
eontents of this speeifieation are being finalized as this paper is being written. We 
expeet implementations of OpenMP 2.0 will be available late in the year 2000 or early 
in 2001. 

To learn more about OpenMP 2.0 and the aetivities of the broader OpenMP 
eommunity, the interested reader should eonsult our web site at www.openmp.org. 
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Abstract. This paper describes the implementation and evaluation of 
the OpenMP compiler designed for the Hitachi SR8000 Super Techni- 
cal Server. The compiler performs parallelization for the shared mem- 
ory multiprocessors within a node of SR8000 using the synchronization 
mechanism of the hardware to perform high-speed parallel execution. To 
create an optimized code, the compiler can perform optimizations across 
inside and outside of a PARALLEL region or can produce a code opti- 
mized for a fixed number of processors according to the compile option. 
For user’s convenience, it supports combination of OpenMP and auto- 
matic parallelization or Hitachi proprietary directive and also supports 
reporting diagnostic messages which help user’s parallelization. 

We evaluate our compiler by parallelizing NPB2. 3-serial benchmark with 
OpenMP. The result shows 5.3 to 8.0 times speedup on 8 processors. 



1 Introduction 

Parallel programming is necessary to exploit high performance of recent su- 
percomputers. Among the parallel programming models, the shared memory 
parallel programming model is widely accepted because of its easiness or incre- 
mental parallelization from serial programs. Until recently, however, to write a 
parallel program for shared memory systems, the user must use vendor-specific 
parallelization directives or libraries, which make it difficult to develop portable 
parallel programs. 

To solve this problem, OpenMP[l][2] is proposed as a common interface for 
the shared memory parallel programming. The OpenMP Application Program- 
ming Interface(API) is a collection of compiler directives, library routines, and 
environment variables that can be used to specify shared memory parallelism in 
Fortran and C,C++ programs. 

Many computer hardware and software vendors are supporting OpenMP. 
Commercial and non-commercial compilers[3][4][5] are available on many plat- 
forms. OpenMP is also being used by Independent Software Vendors for its 
portability. 



M. Valero et al. (Eds.): ISHPC 2000, LNCS 1940, pp. 391-402, 2000. 
(5) Springer-Verlag Berlin Heidelberg 2000 
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We implemented an OpenMP compiler for Hitachi SR8000 Super Techni- 
cal Server. The SR8000 is a parallel computer system consisting of many nodes 
that incorporate high-performance RISC microprocessors and connected via a 
high-speed inter-node network. Each node consists of several Instruction Proces- 
sor(IP)s which share a single address space. Parallel processing within the node 
is performed at high speed by the hardware mechanism called Co-Operative 
Microprocessors in a single Address Space(COMPAS). 

Most implementations such as NanosCompiler[3] or Omni OpenMP compiler[4] 
use some thread libraries to implement the fundamental parallel execution model 
of OpenMP. As the objective of OpenMP for the SR8000 is to control the paral- 
lelism within the node and exploit maximum performance of IPs in the node, we 
implemented the fork-join parallel execution model of OpenMP over COMPAS, 
in which thread invocation is started by the hardware instruction, so that the 
overhead of thread starting or synchronization between threads can be reduced 
and high efficiency of parallel execution can be achieved. 

The other characteristic of our OpenMP compiler is as follows. 

— Can be combined with automatic parallelization or Hitachi proprietary di- 
rective 

Our OpenMP compiler supports full OpenMPl.O specifications. In addition, 
our compiler can perform automatic parallelization or parallelization by Hi- 
tachi proprietary directives. Procedures parallelized by OpenMP can be com- 
bined with procedures parallelized by the automatic parallelization or by the 
Hitachi proprietary directives. This can be used to supplement OpenMP by 
Hitachi proprietary directives or to parallelize whole program by automatic 
parallelization and then to use OpenMP to tune an important part of the 
program. 

— Parallelization support diagnostic messages 

Our compiler can detect loop carried dependence or recognize variables that 
should be given a PRIVATE or REDUCTION attribute and report the re- 
sults of these analysis as diagnostic messages. The user can parallelize serial 
programs according to these diagnostic messages. It also is used to prevent 
incorrect parallelization by reporting warning messages if there is a possibil- 
ity that the user’s directive is wrong. 

— Optimization across inside and outside of the PARALLEL region 

In OpenMP, the code block that should be executed in parallel is explic- 
itly specified by the PARALLEL directive. Like many implementations, our 
compiler extracts PARALLEL region as a procedure to make implementation 
simple. Each thread performs parallel processing by executing that proce- 
dure. We designed our compiler that the extraction of PARALLEL region 
is done after global optimizations are executed. This enables optimizations 
across inside and outside of the PARALLEL region. 

— Eurther optimized code generation by -procnum=8 option 
-procnum=8 option is Hitachi proprietary option with the purpose of bring- 
ing out the maximum performance from the node of SR8000. By default, our 
compiler generates an object that can run with any number of threads, but 
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if -procnum=8 option is specified, the compiler generates codes especially 
optimized for the number of threads fixed to 8. This can exploit maximum 
performance of 8 IPs in the node. 

In this paper, we describe the implementation and evaluation of our OpenMP 
compiler for the Hitachi SR8000. The rest of this paper is organized as follows. 
Section 2 describes an overview of the architecture of the Hitachi SR8000. Sec- 
tion 3 describes the structure and features of the OpenMP compiler. Section 
4 describes the implementation of the OpenMP main directives. Section 5 de- 
scribes the results of performance evaluation of the compiler. Section 5 also 
describes some problems about OpenMP specification found when evaluating 
the compiler. Section 6 concludes this paper. 

2 Architecture of Hitachi SR8000 

The Hitachi SR8000 system consists of computing units, called “nodes”, each of 
which is equipped with multiple processors. Figure 1 shows an overview of the 
SR8000 system architecture. 




Fig. 1. Architecture of the SR8000 system 



The whole system is a loosely coupled, distributed memory parallel processing 
system connected via high-speed multi-dimensional crossbar network. Each node 
has local memory and data are transferred via the network. The remote-DMA 
mechanism enables fast data transfer by directly transferring user memory space 
to another node without copying to the system buffer. 

Each node consists of several Instruction Processor(IP)s. The IPs organize 
shared memory parallel processing system. The mechanism of Co-Operative Mi- 
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croprocessors in a single Address Space(COMPAS) enables fast parallel process- 
ing within the node. 

The mechanism of COMPAS enables simultaneous and high-speed activation 
of multiple IPs in the node. Under COMPAS, one processor issues the “start 
instruction” and all the other processors start computation simultaneously. The 
“start instruction” is executed by hardware, resulting in high-speed processing. 
The operating system of SR8000 also has a scheduling mechanism to exploit 
maximum performance under COMPAS by binding each thread to a fixed IP. 

As described above, SR8000 employs distributed memory parallel system 
among nodes and shared memory multiprocessors within a node that can achieve 
high scalability. The user can use message passing library such as MPI to con- 
trol parallelism among nodes, and automatic parallelization or directive based 
parallelization of the compiler to exploit parallelism within the node. OpenMP 
is used to control parallelism within the node. 



3 Overview of OpenMP Compiler 

OpenMP is implemented as a part of a compiler which generates native codes 
for the SR8000. Figure 2 shows the structure of the compiler. 



FORTRAN + OpenMP 



C + OpenMP 




Fig. 2. Structure of the OpenMP compiler 



The OpenMP API is specified for Fortran and C,C-|— 1-. The front-end com- 
piler for each language reads the OpenMP directives, converts to an intermediate 
language, and passes to the common back-end compiler. The front-end compiler 
for C is now under development. 
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The back-end compiler analyzes the OpenMP directives embedded in the in- 
termediate language and performs parallel transformation. The principal trans- 
formation is categorized as follows. 

— Encapsulate the PARALLEL region as a procedure to be executed in parallel, 
and generate codes that starts the threads to execute the parallel procedure. 

— Convert the loop or block of statements so that the execution of the code is 
shared among the threads according to the work-sharing directives such as 
DO or SECTIONS. 

— According to the PRIVATE or REDUCTION directives, allocate each vari- 
able to the thread private area or generate the parallel reduction codes. 

— Perform necessary code transformation for the other synchronization direc- 
tives. 

The transformed code is generated as an object file. The object file is linked 
with the OpenMP runtime library and the executable file is created. 

Our compiler supports full set of OpenMP Eortran API 1.0. All of the direc- 
tives, libraries, and environment variables specified in the OpenMP Eortran API 
1.0 can be used. However, nested parallelism and the dynamic thread adjustment 
are not implemented. 

In addition, our compiler has the following features. 

— Can be combined with automatic parallelization or Hitachi proprietary di- 
rectives 

Procedures parallelized by OpenMP can be mixed together with procedures 
parallelized by automatic parallelization or Hitachi proprietary directives. 
By this feature, functions which do not exist in OpenMP can be supple- 
mented by automatic parallelization or Hitachi proprietary directives. Eor 
example, array reduction is not supported in OpenMP 1.0. Then automatic 
parallelization or Hitachi proprietary directives can be used to parallelize 
the procedure which needs the array reduction and OpenMP can be used to 
parallelize the other procedures. 

— Parallelization support diagnostic message 

Basically in OpenMP, the compiler performs parallelization exactly obeying 
the user’s directive. However, when creating a parallel program from a serial 
program, it is not necessarily easy for the user to determine if a loop can be 
parallelized or if a variable needs privatization. 

Eor this reason, our compiler can perform the same analysis as the auto- 
matic parallelization even when OpenMP is used and report the result of 
the analysis as the diagnostic messages. 



4 Implementation of OpenMP Directives 



In this section, we describe the implementation of the main OpenMP directives. 
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4.1 PARALLEL Region Construct 

In OpenMP, the code section to be executed in parallel is explicitly specified as 
a PARALLEL region. OpenMP uses the fork-join model of parallel execution. 
A program begins execution as a single process, called the master thread. When 
the master thread reaches the PARALLEL construct, it creates the team of 
threads. The codes in the PARALLEL region is executed in parallel and in a 
duplicated manner when explicit work-sharing is not specified. At the end of 
the PARALLEL region, the threads synchronize and only the master thread 
continues execution. 

We implemented the PARALLEL region of OpenMP for SR8000 using COM- 
PAS, so that the thread fork-join can be performed at high speed. Eigure 3 
represents the execution of PARALLEL region on COMPAS. 



IPl IP2 




Fig. 3. Execution of PARALLEL region on COMPAS 



Under the COMPAS mechanism, each thread is bound to a fixed IP. The 
execution begins at the IP corresponding to the master thread, and the others 
wait for starting. 

When the master thread reaches the PARALLEL construct, the master IP 
issues the “start instruction” to the other IPs. This “start instruction” is ex- 
ecuted by hardware resulting in high-speed processing. The IPs receiving the 
“start instruction” also receive the starting address of the PARALLEL region 
and immediately begin the execution simultaneously. 

When PARALLEL region ends, each IP issues the “end instruction” and 
synchronizes. Then only the master IP continues the execution and the other 
IPs again enter the wait status. 

Using COMPAS mechanism, the thread creation of OpenMP is performed 
by the hardware instruction to the already executing IPs, so that the overhead 
of thread creation can be reduced. Also as each thread is bound to a fixed IP, 
there is no overhead of thread scheduling. 
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The statements enclosed in PARALLEL region is extracted as a procedure 
inside the compilation process. The compiler generates a code that each thread 
runs in parallel by executing this parallel procedure. The PARALLEL region is 
extracted to a procedure to simplify the implementation. As the code semantics 
inside the PARALLEL region may differ from those outside the PARALLEL 
region, normal optimizations for serial programs cannot always be legal across 
the PARALLEL region boundary. Dividing the parallel part and non-parallel 
part by extracting the PARALLEL region allows normal optimizations without 
being concerned about this problem. The implementation of the storage class 
is also simplified by allocating the auto storage of the parallel procedure to the 
thread private area. The privatization of the variable in the PARALLEL region 
is done by internally declaring it as the local variable of the parallel procedure. 

However, if the PARALLEL region is extracted as a procedure at the early 
stage of compilation, the problem arises that necessary optimization across inside 
and outside of the PARALLEL region is made unavailable. Eor example, if the 
PARALLEL region is extracted to a procedure before constant propagation, 
propagation from outside to inside of the PARALLEL region cannot be done, 
unless interprocedural optimization is performed. 

To solve this problem, our compiler first executes basic, global optimization 
then extracts the PARALLEL region to a parallel procedure. This enables op- 
timizations across inside and outside of the PARALLEL region. As described 
above, however, the same optimization as serial program may not be done as 
the same code in the parallel part and non-parallel part may have different se- 
mantics of execution. Eor example, the optimization which moves the definition 
of the PRIVATE variable from inside to outside the PARALLEL region should 
be prohibited. We avoid this problem by inserting the dummy references of the 
variables for which such optimization should be prohibited at the entry and exit 
point of the PARALLEL region. 



4.2 DO Directive 

The DO directive of OpenMP parallelizes the loop. The compiler translates the 
codes that set the range of loop index so that the loop iterations are partitioned 
among threads. In addition to parallelizing loops according to the user’s directive, 
our compiler has the following features: 

1. Optimization of loop index range calculation codes 

The code to set index range of parallel loop contains an expression which 
calculates the index range to be executed by each thread using the thread 
number as a parameter. When parallelizing the inner loop of a loop nest, the 
calculation of the loop index range will become an overhead if it occurs just 
before the inner loop. Our compiler performs the optimization that moves 
the calculation codes out of the loop nest if the original range of the inner 
loop index is invariant in the loop nest. 

2. Performance improvement by -procnum=8 option 
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In OpenMP, the number of threads to execute a PARALLEL region is de- 
termined by the environment variable or runtime library. This means the 
number of threads to execute a loop cannot be determined until runtime. 
So the calculation of the index range of a parallel loop contains the division 
expression of the loop length by the number of threads. While this feature 
increases the usability because the user can change the number of threads at 
runtime without recompiling, the calculation of the loop index range will be- 
come an overhead if the number of threads used is always the same constant 
value. 

Our compiler supports -procnum=8 option which aims at exploiting the 
maximum performance of the 8 IPs in the node. If -procnum=8 option is 
specified, the number of threads used in the calculation of the loop index 
range is assumed as the constant number 8. As the result, the performance 
is improved especially if the loop length is a constant as the loop length 
after parallelization is evaluated to a constant at the compile time and the 
division at runtime is removed. 

3. Parallelizing support diagnostic message 

In OpenMP, the compiler performs parallelization according to the user’s 
directive. It is the user’s responsibility to ensure that the parallelized loop 
has no dependence across loop iterations or whether privatization is needed 
for each variable. However, it is not always easy to determine if a loop can 
be parallelized or to examine the needs of privatization for all the variables 
in the loop. Especially, once the user gives an incorrect directive, it may take 
a long time to discover the mistake because of the difficulties of debugging 
peculiar to the parallel program. 

Eor this reason, while parallelizing exactly obeying the user’s directive, our 
compiler provides the function of reporting the diagnostic messages which 
is the result of the parallelization analysis of the compiler. The compiler 
inspects the statements in a loop and analyze whether there is any statement 
which prevents parallelization or variable that has dependence across loop 
iterations. It also recognizes the variables which need privatization or parallel 
reduction operation. Then if there is any possibility that the user’s directive 
is wrong, it generates the dianostic messages. Eigure 4 shows the example of 
the diagnostic message. 

Also the compiler can report information for the loop with no OpenMP 
directives specified whether the loop can be parallelized or each variable 
needs privatization. This is useful when converting serial program to parallel 
OpenMP program. 



4.3 SINGLE and MASTER Directives 

SINGLE and MASTER directives both specify the statements enclosed in the 
directive to be executed once only by one thread. The difference is that SINGLE 
directive specifies the block of statements to be executed by any one of the 
threads in the team while MASTER specifies the block to be executed by the 
master thread. Barrier synchronization is inserted at the end of the SINGLE 
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1: subroutine sub(a, c ,n,m) 

2: real a(n,m) 

3: !$0MP PARALLEL DO PRIVATE (tmpl) 

4: do j=2,m 

5: do i=l,n 

6 : 

7: tmpl=a(i, jml) 

8: a(i , j ) =tmpl/c 

9 : enddo 

10 : enddo 

11: end 

(diagnosis for loop structure) 

KCHF2015K 

the do loop is parallelized by 
"omp do" directive. line=4 

KCHF2200K 

the parallelized do loop contains 
data dependencies across loop 
iterations. name=A line=4 

KCHF2201K 

the variable or array in do loop is 
not privatized. name=JMl line=4 

Fig. 4. Example of diagnostic message 



directive unless NOWAIT clause is specified, but no synchronization is done at 
the end of the MASTER directive and at the entry of the both directives. 

The SINGLE directive is implemented that the first thread which reaches 
the construct executes the SINGLE block. This is accomplished by arranging 
SHARED attribute flag which means the block is already executed and accessed 
by each thread to determine whether the thread should execute the block or not. 
This enables timing adjustment if there is difference of execution timing among 
threads. 

In contrast, the implementation of the MASTER directive is to add condi- 
tional branch that the block is executed only by the master thread. 

The execution of the SINGLE directive is controled by the SHARED at- 
tribute flag, so the flag should be accessed exclusively by the lock operation. 
As the lock operation generally requires high cost of time, it becomes an over- 
head. As the result, in case that the execution timings of the threads are almost 
the same, using MASTER(-I-BARRIER) directive may show better performance 
than using SINGLE directive. 




400 



Yasunori Nishitani et al. 



5 Performance 

5.1 Measurement Methodology 

We evaluated our OpeuMP compiler by the NAS Parallel Benchmarks(NPB)[6]. 
The version used is NPB2. 3-serial. We parallelized the benchmark by only insert- 
ing the OMP directives and without modifying the execution statement, though 
some rewriting is done where the program cannot be parallelized due to the 
limitation of current OpenMP specification(mentioned later). The problem size 
is class A. 

The benchmark was run on one node of the Hitachi SR8000, and the serial 
execution time and parallel execution time on 8 processors are measured. 



5.2 Performance Results 

Figure 5 shows the result of performance. 




Benchmark Name 



Fig. 5. Speedup of NPB2. 3-serial 



The figure shows that the speedup on 8 processors is about 5.3 to 8.0 for 
6 benchmarks except LU. The reason why the speedup of LU is low is that it 
contains wavefront style loop which every loop in the loop nest has dependence 
across loop iterations as follows, and the loop is not parallelized. 



1 
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4 
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7 



do j=jst,jend 
do i=ist,iend 
do m=l,5 

v(m,i, j ,k) = v(m,i, j ,k) 

- omega* (Idy (m,l,i,j)*v(l,i,j-l,k) 
+ldx(m,l,i,j)*v(l,i-l,j ,k) 



> 

> 
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8 : 

9 : enddo 

10: enddo 

11: enddo 

As the DO directive of OpenMP indicates that the loop has no dependence 
across loop iterations, this kind of loops cannot be parallelized easily. 

The automatic parallelization of our compiler can parallelize such a wave- 
front style loop. We parallelized the subroutine with such a loop by automatic 
parallelization, parallelized the other subroutines by OpenMP, and measured the 
performance. The result on 8 processors shows the speedup of 6.3. 



5.3 Some Problems of OpenMP Specification 

When we parallelized the NPB or other programs by OpenMP, we met the case 
that the program cannot be parallelized or the program must be rewritten to 
enable parallelization, as some function do not exist in OpenMP. We describe 
some of the cases below. 

1. Parallelization of a wavefront style loop 

As mentioned in section 5.2, we met the problem that a wavefront style 
loop cannot be parallelized in the LU benchmark of NPB. This was because 
OpenMP has only directives which mean that the loop has no dependence 
across loop iterations. 

2. Array reduction 

In the EP benchmark of NPB, parallel reduction for an array is needed. How- 
ever, in the OpenMP 1.0 specification, only the scalar variable can be spec- 
ified as REDUCTION variable and arrays cannot be specified. This causes 
rewriting of the source code. 

3. Parallelization of loop containing induction variable 

It is often needed to parallelize the loop that involves an induction variable 
as follows. 

K=. . . 

do 1=1, N 
A(2*I-1)=K 
A(2*I)=K+3 
K=K+6 
enddo 

However, also in this case it cannot be parallelized only with the OpenMP 
directive and needs to modify the source code. 

As these kind of parallelizations described above often appear when paral- 
lelizing real programs, it is desirable to extend OpenMP so that these loops can 
be parallelized only by OpenMP directives. 
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6 Conclusions 

In this paper, we described the implementation and evaluation of the OpenMP 
compiler for parallelization within the node of Hitachi SR8000. This compiler im- 
plements the fork-join execution model of OpenMP using hardware mechanism 
of SR8000 and achieves high efficiency of parallel execution. We also made our 
compiler possible to perform optimization across inside and outside of the PAR- 
ALLEL region or to generate codes optimized for 8 processor by -procnum=8 
option. Furthermore, for user’s convenience, we implemented parallelizing sup- 
port diagnostic messages that help the user’s parallelization, or enabled combi- 
nation with automatic parallelization or Hitachi proprietary directives. We eval- 
uated this compiler by parallelizing NAS Parallel Benchmarks with OpenMP 
and achieved about 5.3 to 8.0 speedup on 8 processors. 

Through this evaluation of OpenMP, we found several loops cannot be par- 
allelized because there are features which OpenMP lacks. These loops can be 
parallelized by automatic parallelization or Hitachi proprietary directives pro- 
vided by our compiler. We intend to develop the extension to the OpenMP 
specification so that these loops can be parallelized only by directives. 
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Abstract. We developed an OpenMP compiler, called Omni. This pa- 
per describes a performance evaluation of the Omni OpenMP compiler. 
We take two commercial OpenMP C compilers, the KAI GuideC and 
the PCI C compiler, for comparison. Microbenchmarks and a program 
in Parkbench are used for the evaluation. The results using a SUN Enter- 
prise 450 with four processors show the performance of Omni is compara- 
ble to a commercial OpenMP compiler, KAI GuideC. The parallelization 
using OpenMP directives is effective and scales well if the loop contains 
enough operations, according to the results. 

Keywords: OpenMP, compiler, Microbenchmarks, parkbench, perfor- 
mance evaluation 



1 Introduction 

Multi-processor workstations and PCs are getting popular, and are being used 
as parallel computing platforms in various types of applications. Since porting 
applications to parallel computing platforms is still a challenging and time con- 
suming task, it would be ideal if it could be automated by using some paralleliz- 
ing compilers and tools. However, automatic parallelization is still a challenging 
research topic and is not yet at the stage where it can be put to practical use. 

OpenMP [1], which is a collection of compiler directives, library routines, and 
environment variables, is proposed as a standard interface to parallelize sequen- 
tial programs. The OpenMP language specification came out in 1997 for Fortran, 
and in 1998 for C/C-I--I-. Recently, compiler vendors for PCs and workstations 
have endorsed the OpenMP API and have released commercial compilers that 
are able to compile an OpenMP parallel program. 

There have been several efforts to make a standard for compiler directives, 
such as OpenMP and HPF[12]. OpenMP aims to provide portable compiler di- 
rectives for shared memory programming. On the other hand, HPF was designed 
to provide data parallel programming for distributed or non-uniform memory ac- 
cess systems. These specifications were originally supported only in Fortran, but 
OpenMP announced specifications for C and C-I--I-. In OpenMP and HPF, the 
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directives specify parallel actions explicitly rather than as hints for paralleliza- 
tion. 

While high performance computing programs, especially for scientific com- 
puting, are often written in Fortran as the programming language, many pro- 
grams are written in C in workstation environments. We focus on OpenMP C 
compilers in this paper. We also report our evaluation of the Omni OpenMP 
compiler [4] and make a comparison between Omni and commercial OpenMP C 
compilers. The objectives of our experiment are to evaluate available OpenMP 
compilers including our Omni OpenMP compiler, and examine the performance 
improvement gained by using the OpenMP programming model. 

The remainder of this paper is organized as follows: Section 2 presents the 
overview of the Omni OpenMP compiler and its components. The platforms and 
the compilers we tested for our experiment are described in section 3. Section 4 
introduces Microbenchmarks, an OpenMP benchmark program developed at the 
University of Edinburgh, and shows the results of an evaluation using it. Section 
5 presents a further evaluation using another benchmark program, Parkbench. 
Section 6 describes related work and we conclude in section 7. 

2 The Omni OpenMP Compiler 

We are developing an experimental OpenMP compiler, Omni [4] , for an SMP 
machine. An overview of the Omni OpenMP compiler is presented in this section. 

The Omni OpenMP compiler is a translator which takes OpenMP programs 
as input and generates multi-thread C programs with run-time library calls. The 
resulting programs are compiled by a native C compiler, and then linked with 
the Omni run-time library to execute in parallel. The Omni is supported the 
POSIX thread library for parallel execution, and this makes it easy to port the 
Omni to other platforms. The platforms the Omni has already been ported to 
are the Solaris on Sparc and on intel, Linux on intel, IRIX and AIX. 

The Omni OpenMP compiler consists of three parts, a front-end, the Exc 
Java tool and a run-time library. Figure 1 illustrates the structure of Omni. 

The Omni front-end accepts programs parallelized using OpenMP directives 
that are specified in the OpenMP application program interface [2] [3]. The front- 
end for C and FORTRAN?? are available now, and a C-I--I- version is under 
development. The input program is parsed into an Omni intermediate code, 
called Xobject code, for both C and FORTRAN??. 

The next part, the Exc Java tool, is a Java class library that provides classes 
and methods to analyze and transform the Xobject intermediate code. It also 
generates a parallelized C program from the Xobject. The representation of 
Xobject code which is manipulated by the Exc Java tool is a kind of Abstract 
Syntax Tree(AST) with data type information. Each node of the AST is a Java 
object that represents a syntactical element of the source code that can be easily 
transformed. The Exc Java tool encapsulates the parallel execution part into a 
separate function to translate a sequential program with OpenMP directives into 
a fork-join parallel program. 
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F77 + OpenMP C + OpenMP C++ + OpenMP 
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Omni OpenMP compile]^ 

C + runtime library 




Fig. 1. Omni OpenMP compiler 



Figures 2 and 3 show the input OpenMP code fragment and the parallelized 
code which is translated by Omni, respectively. A master thread calls the Omni 



funcOf 

#pragma omp parallel for 
for ( . . . ){ 

x=y . . . 

> 



Fig. 2. OpenMP program fragment 

run-time library, _ompc_do_parallel, to invoke slave threads which execute the 
function in parallel. Pointers to shared variables with auto storage classes are 
copied into a shared memory heap and passed to slaves at the fork. Private 
variables are redeclared in the functions generated by the compiler. The work 
sharing and synchronization constructs are translated into codes which contain 
the corresponding run-time library calls. 

The Omni run-time library contains library functions used in the translated 
program, for example, _ompc_do_parallel in Figure 3, and libraries that are spec- 
ified in the OpenMP API. For parallel execution, the POSIX thread library and 
the Solaris thread library on Solaris OS can be used according to the Omni com- 
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void ompc_func_6(void ** ompc_args) 

4 

auto double **_pp_x; 
auto double **_pp_y; 

_pp_x = (double **)*( ompc_args+0) ; 

_pp_y = (double **)*( ompc_args+l) ; 

/* index calculation */ 
for( . . . ){ 

— P-X=__p_y. . . 

} 

> 

> 

funcOH! 

■(/+ #pragma omp parallel for */ 

auto void * ompc_argv [2] ; 

*( ompc_argv+0) = (void *)&x; 

*( ompc_argv+l) = (void *)&y; 

_ompc_do_parallel ( ompc_f unc_6 , ompc_argv) ; 

> 



Fig. 3. Program parallelized using Omni 



pilation option. The Omni compilation option also allows use of the mutexJock 
function instead of the spin- wait lock we developed, the default lock function 
in Omni. The 1-read/n- write busy-wait algorithm[13] is used as a default Omni 
barrier function. 

Threads are allocated at the beginning of an application program in Omni, 
not at every parallel execution part contained in the program. All threads but the 
master are waiting in a conditional wait state until the start of parallel execution, 
triggered by the library call described before. The allocation and deallocation of 
these threads are managed by using a free list in the run-time library. The list 
operations are executed exclusively using the system lock function. 



3 Platforms and OpenMP Compilers 

The following machines were used as platforms for our experiment. 

— SUN Enterprise 450(Ultra spare 300MHz x4), Solaris 2.6, SUNWspro 4.2 C 
compiler, JDK1.2 

- COMPaS-II(COMPAQ ProLiant6500, Pentium-II Xeon 450MHz x4), Red- 
Hat Linux 6.0-|-kernel-2.2.12, gcc-2.91.66, JDKl.1.7 

We evaluated commercial OpenMP C compilers as well as the Omni OpenMP 
compiler. The commercial OpenMP C compilers we tested are: 
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— KAI GuideC Ver.3.8[10] on the SUN, and 
~ PGI G compiler pgcc 3.1-2[11] on the GOMPaS-II. 

KAI GuideG is a preprocessor that translates OpenMP programs into paral- 
lelized G programs with library calls. On the other hand, the PGI G compiler 
translates an input program directly to the executable code. The compile op- 
tions used in the following tests are ’-fast’ for the SUN G compiler, ’-03 -malign- 
double’ for the GNU gcc, and ’-mp -fast’ for the PGI G compiler. 

4 Performance Overhead of OpenMP 

This section presents the evaluation of the performance overhead of OpenMP 
compilers using Microbenchmarks. 

4.1 Microhenchmarks 

Microbenchmarks [6], developed at the University of Edinburgh, is intended to 
measure the overheads of synchronization and loop scheduling in the OpenMP 
runtime library. The benchmark measures the performance overhead incurred 
by the OpenMP directives, for example ’parallel’, ’for’ and ’barrier’, and the 
overheads of the parallel loop using different scheduling options and chunk sizes. 

4.2 Results on the SUN System 

Figure 4 shows the results of using the Omni OpenMP compiler and KAI GuideG. 
The native G compiler used for both OpenMP compilers is the SUNWspro 4.2 
G compiler with the ’-fast’ optimization option. 



time(usec) 
18 




"parallel'' -O- 
"for" -+- 
"parallel-for" 
"barrier" 
"single" A- 
"critical" 
"lock-unlock"-0- 
"ordered" H— 
"atomic" -Q 
"reduction 




"parallel" -0- 
"for" -+- 
"parallel-for" 
"barrier" -x- 
"single" -A- 
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”lock-unIock"-<>- 
"ordered" -H 
"atomic" -D' 
"reduction' 



Fig. 4. Overhead of Omni(left) and KAI(right) 



408 



Kazuhiro Kusano et al. 



These results show the Omni OpenMP compiler achieves competitive perfor- 
mance when compared to the commercial KAI GuideC OpenMP compiler. The 
overhead of ’parallel’, ’parallel-for’ and ’parallel-reduction’ is bigger than that 
of other directives. This indicates that it is important to reduce the number of 
parallel regions to achieve good parallel performance. 

4.3 Results on the COMPaS-II System 

The results of using the Omni OpenMP compiler and the PGI G compiler on 
the GOMPaS-II are shown in Figure 5. 
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Fig. 5. Overhead of Omni(left) and PGI(right) 



The PGI compiler shows very good performance, especially for ’parallel’, 
’parallel-for’ and ’parallel-reduction.’ The overhead of Omni for those directives 
increases almost linearly. Although the overhead of Omni for those directives is 
twice that of PGI, it is reasonable when compared to the results on the SUN. 

4.4 Breakdown of the Omni Overhead 

The performance of ’parallel’, ’parallel-for’ and ’parallel-reduction’ directives 
originally scales poorly on Omni. We made some experiments to breakdown the 
overhead of the ’parallel’ directive, and, as a result, we found that the data 
structure operation used to manage parallel execution and synchronization in 
the Omni run-time library spent most of the overhead. 

The threads are allocated once the initialization phase of a program execu- 
tion, and, after that, idle threads are managed by the run-time library using an 
idle queue. This queue has to be operated exclusively and this serialized queue 
operations. In addition to the queue operation, there is a redundant barrier syn- 
chronization at the end of the parallel region in the library. We modified the 
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run-time library to reduce the number of library calls which require exclusive 
operation and eliminate redundant synchronization. As a result, the performance 
shown in Figures 4 and 5 are achieved. Though the overhead of ’parallel for’ on 
the COMPaS-II is unreasonably big, the cause of this is not yet fixed. 

Table 1 is the time spent for an allocation of threads and a release of threads 
and barrier synchronization on the COMPaS-II system. 



Table 1. Time to allocate/release data(usec(%)) 



PE 


1 


2 


3 


4 


allocation 


0.40(43) 


2.7(67) 


3.5(65) 


4.0(63) 


release -|- barrier 


0.29(31) 


0.50(12) 


0.56(10) 


0.60(9) 



This shows thread allocation still spent the most of the overhead. 

5 Performance Improvement from Using OpenMP 
Directives 

This section describes the performance improvements using the OpenMP direc- 
tives. 

We take a benchmark program from Parkbench to use in our evaluation. The 
performance improvements of a few simple loops with the iterations ranging from 
one to 100,000 show the efficiency of the OpenMP programming model. 



5.1 Parkbench 

Parkbench [8] is a set of benchmark programs designed to measure the perfor- 
mance of parallel machines. Its parallel execution model is message passing using 
PVM or MPI. It consists of low-level benchmarks, kernel benchmarks, compact 
applications and HPF benchmarks. 

We use one of the programs, rinfl, in the low-level benchmarks to carry out 
our experiment. The low-level benchmark programs are intend to measure the 
performance of a single processor. We rewrote the rinfl program in C, because 
the original was written in Fortran. The rinfl program takes a set of common 
Fortran operation loops in different loop lengths. For the following test, we chose 
kernel loops 3, 6 and 16. Figure 6 shows code fragments from a rinfl program. 

5.2 Results on the SUN System 

Figures 7, 8 and 9 show the results of kernel loops 3, 6 and 16, respectively, in 
the rinfl benchmark program which was parallelized using OpenMP directives 
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for( jt = 0 ; jt < ntim ; jt++ )■[ 
dummy (jt) ; 

#pragma omp parallel for 

for( i = 0 ; i < n ; i++ )/+ kernel 3 */ 
a[i] = b[i] * c[i] + d[i] ; 

} 

#pragma omp parallel for 

for( i = 0 ; i < n ; i++ )/+ kernel 6 */ 
a[i] = b[i] * c[i] + d[i] * e[i] + f [i] ; 



Fig. 6. rinfl kernel loop 




Fig. 7. kernel 3[a(i)=b(i)*c(i)+d(i)] on the SUN: Omni(L) and KAI(R) 



executed on the SUN machine. In these graphs, the x-axis is loop length, and 
the y-axis represents performance in Mflops. 

Both OpenMP compilers, Omni and KAI GuideC, achieve almost the same 
performance improvement, though there are some differences. The differences 
resulted mainly from the run-time library, because both OpenMP compilers 
translate to the C program with run-time library calls. KAI GuideC shows better 
performance for short loop lengths of kernel 6 on one processor, and the peak 
performance for kernel 16 on two and four processors is better than that of Omni. 

5.3 Results on the COMPaS-II System 

Figures 10, 11 and 12 are the results of kernel loops in the rinfl benchmark 
program which were parallelized using the OpenMP directive executed on the 
COMPaS-II. The x-axis represents loop length, and the y-axis represents per- 
formance in Mflops, the same as in the previous case. 

The results show the PGI compiler achieves better performance than the 
Omni OpenMP compiler on the COMPaS-II. The PGI compiler achieves very 
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Fig. 8. kernel 6[a(i)=b(i)*c(i)+d(i)*e(i)+f(i)] on the SUN: Omni(L) and KAI(R) 




Fig. 9. kernel 16[a(i)=s*b(i)+c(i)] on the SUN: Omni(L) and KAI(R) 



good performance for short loop lengths on one processor. The peak performance 
of PGI reaches about 400 Mflops or more on four processors, and it is nearly 
double that of Omni in kernels 3 and 16. 



5.4 Discussion 

Omni and KAI GuideC achieve almost the same performance improvement on 
the SUN, but the points described above must be kept in mind. The performance 
improvement of the PGI compiler on the GOMPaS-II has different characteristics 
when compared to the others. Especially, the PGI achieves higher performance 
for short loop lengths than the Omni on one processor, and the peak performance 
nearly doubles for kernel 3 and 16. This indicates the performance of Omni 
could be improved on the GOMPaS-II by the optimization of the Omni run- 
time library, though one must consider the fact that the backend of Omni is 
different. 
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Fig. 10. kernel 3[a(i)=b(i)*c(i)+d(i)] on the COMPaS-II: Omni(L) and PGI(R) 
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Fig. 11. kernel 6[a(i)=b(i)*c(i)+d(i)*e(i)+f(i)] on the COMPaS-II: Omni(L) and 
PGI(R) 



Those results show that parallelization using the OpenMP directives is ef- 
fective and the performance scales up for tiny loops if the loop length is long 
enough. 

6 Related Work 

Lund University in Sweden developed a free OpenMP G compiler, called 
OdinMP/GGp[5]. It is also a translator to a multi-thread G program and uses 
Java as its development language, the same as our Omni. The difference is found 
in the input language. OdinMP/GGp only supports G as input, while Omni 
supports G and FORTRAN??. The development language of each frontend is 
also different, G in Omni and Java in OdinMP/GGp. 

There are many projects related to OpenMP, for example, research to ex- 
ecute an OpenMP program on top of the Distributed Shared Memory(DSM) 
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1 10 100 1000 10000 100000 1 10 100 1000 10000 100000 
Fig. 12. kernel 16[a(i)=s*b(i)+c(i)] on the COMPaS-II: Omni(L) and PGI(R) 



environment on a network of workstations [7], and the investigation of a parallel 
programming model based on the MPI and the OpenMP to utilize the mem- 
ory hierarchy of an SMP cluster[9]. Several projects, including OpenMP ARB, 
have stated the intention to develop an OpenMP benchmark program, though 
Microbenchmarks [6] is the only one available now. 

7 Conclusions 

This paper presented an overview of the Omni OpenMP compiler and an evalua- 
tion of its performance. The Omni consists of a front-end, an Exc Java tool, and 
a run-time library, and translates an input OpenMP program to a parallelized 
C program with run-time library calls. We chose Microbenchmarks and a pro- 
gram in Parkbench to use for our evaluation. While Microbenchmarks measures 
the performance overhead of each OpenMP construct, the Parkbench program 
evaluates the performance of array calculation loop parallelized by using the 
OpenMP programming model. The latter gives some criteria to use to paral- 
lelize a program using OpenMP directives. 

Our evaluation, using benchmark programs, shows Omni achieves compa- 
rable performance to a commercial OpenMP compiler, KAI GuideC, on a SUN 
system with four processors. It also reveals a problem with the Omni run-time li- 
brary which indicates that the overhead of thread management data is increased 
according to the number of processors. 

On the other hand, the PGI compiler is faster than the Omni on a GOMPaS- 
II system, and it indicates the optimization of the Omni run-time library could 
improve its performance, though one must consider the fact that the backend of 
Omni is different 

The evaluation also shows that parallelization using the OpenMP directives 
is effective and the performance scales up for tiny loops if the loop length is long 
enough, while the GOMPaS-II requires very careful optimization to get peak 
performance. 
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Abstract. This paper describes transparent mechanisms for emulating 
some of the data distribution facilities offered by traditional data-parallel 
programming models, such as High Performance Fortran, in OpenMP. 
The vehicle for implementing these facilities in OpenMP without mod- 
ifying the programming model or exporting data distribution details to 
the programmer is user-level dynamic page migration [9,10]. We have im- 
plemented a runtime system called UPMlib, which allows the compiler to 
inject into the application a smart user-level page migration engine. The 
page migration engine improves transparently the locality of memory ref- 
erences at the page level on behalf of the application. This engine can ac- 
curately and timely establish effective initial page placement schemes for 
OpenMP programs. Furthermore, it incorporates mechanisms for tuning 
page placement across phase changes in the application communication 
pattern. The effectiveness of page migration in these cases depends heav- 
ily on the overhead of page movements, the duration of phases in the 
application code and architectural characteristics. In general, dynamic 
page migration between phases is effective if the duration of a phase is 
long enough to amortize the cost of page movements. 
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1 Introduction 

One of the most important problems that programming models based on the 
shared-memory communication abstraction are facing on distributed shared- 
memory multiprocessors is poor data locality [3,4]- The non-uniform memory 
access latency of scalable shared-memory multiprocessors necessitates the align- 
ment of threads and data of a parallel program, so that the rate of remote 
memory accesses is minimized. Plain shared- memory programming models hide 
the details of data distribution from the programmer and rely on the operat- 
ing system for laying out the data in a locality-aware manner. Although this 
approach contributes to the simplicity of the programming model, it also jeop- 
ardizes performance, if the page placement strategy employed by the operating 
system does not match the memory reference pattern of the application. Increas- 
ing the rate of remote memory accesses implies an increase of memory latency 
by a factor of three to five and may easily become the main bottleneck towards 
performance scaling. 

OpenMP has become the de-facto standard for programming shared-memory 
multiprocessors and is already widely adopted in the industry and the academia 
as a simple and portable parallel programming interface [11]. Unfortunately, in 
several case studies with industrial codes OpenMP has exhibited performance 
inferior to that of message-passing and data parallel paradigms such as MPI and 
HPF, primarily due to the inability of the programming model to control data 
distribution [1,12]. OpenMP provides no means to the programmer for distribut- 
ing data among processors. Although automatic page placement schemes at the 
operating system level, such as first-touch and round-robin, are often sufficient 
for achieving acceptable data locality, explicit placement of data is frequently 
needed to sustain efficiency on large-scale systems [4]. 

The natural means to surmount the problem of data placement on distributed 
shared-memory multiprocessors is data distribution directives [2]. Indeed, ven- 
dors of scalable shared-memory systems are already providing the programmers 
with platform-specific data distribution facilities and the introduction of such 
facilities in the OpenMP programming interface is proposed by several ven- 
dors. Offering data distribution directives similar to the ones offered by High- 
performance Fortran (HPF) [7] in shared-memory programming models has 
two fundamental shortcomings. First, data distribution directives are inherently 
platform-dependent and thus hard to standardize and incorporate seamlessly in 
shared-memory programming models like OpenMP. OpenMP seeks for portable 
parallel programming across a wide range of architectures. Second, data distri- 
bution is subtle for programmers and compromises the simplicity of OpenMP. 
The OpenMP programming model is designed to enable straightforward par- 
allelization of sequential codes, without exporting architectural details to the 
programmer. Data distribution contradicts this design goal. 

Dynamic page migration [14] is an operating system mechanism for tuning 
page placement on distributed shared memory multiprocessors, based on the 
observed memory reference traces of each program at runtime. The operating 
system uses per-node, per-page hardware counters, to identify the node of the 
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system that references more frequently each page in memory. In case this node 
is other than the node that hosts the page, the operating system applies a com- 
petitive criterion and migrates the page to the most-frequently referencing node, 
if the page migration does not violate a set of resource management constraints. 
Although dynamic page migration was proposed merely as an optimization for 
parallel programs with dynamically changing memory reference patterns, it was 
recently shown that a smart page migration engine can also be used as a means 
for achieving good data placement in OpenMP without exporting architectural 
details to the programmer [9,10]. 

In this paper we present an integrated compiler/runtime/OS page migra- 
tion framework, which emulates data distribution and redistribution in OpenMP 
without modifying the OpenMP application programming interface. The key for 
leveraging page migration as a data placement engine is the integration of the 
compiler in the page migration mechanism. The compiler can provide useful in- 
formation on three critical factors that determine data locality: the areas of the 
address space of the program which are likely to concentrate remote memory 
accesses, the structure of the program, and the phase changes in the memory 
reference pattern. This information can be exploited to trigger a page migration 
mechanism at the points of execution at which data distribution or redistribution 
would be theoretically needed to reach good levels of data locality. 

We show that simple mechanisms for page migration can be more than suf- 
ficient for achieving the same level of performance that an optimal initial data 
distribution scheme achieves. Furthermore, we show that dynamic page migra- 
tion can be used for phase-driven optimization of data placement under the 
constraint that the computational granularity of phases is coarse enough to en- 
able the migration engine to balance the high cost of coherent page movements 
with the earnings from reducing the number of remote memory accesses. The 
presented mechanisms are implemented entirely at user-level in UPMlib (User- 
level Page Migration library), a runtime system designed to tune transparently 
the memory performance of OpenMP programs on the SGI 0rigin2000. For de- 
tails on the implementation and the page migration algorithms of UPMlib the 
reader is referred to [9,10]. This paper emphasizes the mechanisms implemented 
in UPMlib to emulate data distribution. 

The rest of this paper is organized as follows. Section 2 shows how user- level 
page migration can be used to emulate data distribution and redistribution in 
OpenMP. Section 3 provides a set of experimental results that substantiate the 
argument that dynamic page migration can serve as an effective substitute for 
page distribution and redistribution in OpenMP. Section 4 concludes the paper. 

2 Implementing Transparent Data Distribution at 
User-Level 

This section presents mechanisms for emulating data distribution and redistri- 
bution in OpenMP programs without programmer intervention, by leveraging 
dynamic page migration at user-level. 
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2.1 Initial Data Distribution 

In order to approximate an effective initial data distribution scheme, a page mi- 
gration engine must be able to identify early in the execution of the program 
the node in which each page should be placed according to the expected mem- 
ory reference trace of the program. Our user-level page migration engine uses 
two mechanisms for this purpose. The first mechanism is designed for iterative 
programs, i.e. programs that enclose the complete parallel computation in an 
outer sequential loop and repeat exactly the same computation for a number 
of iterations, typically corresponding to time steps. This class of programs rep- 
resents the vast majority of parallel codes. The second mechanism is designed 
for non-iterative programs and programs which are iterative but do not repeat 
the same reference trace in every iteration. Both mechanisms operate on ranges 
of the virtual address space of the program which are identified as hot memory 
areas by the OpenMP compiler. In the current setting, hot areas are the shared 
arrays which are both read and written in possibly disjoint OpenMP PARALLEL 
DO and PARALLEL SECTIONS constructs. 

The iterative mechanism is activated by having the OpenMP compiler in- 
strument the programs to invoke the page migration engine at the end of every 
outer iteration of the computation. At these points, the page migration engine 
obtains an accurate snapshot of the complete page reference trace of the program 
after the execution of the first iteration. Since the recorded reference pattern will 
repeat itself throughout the lifetime of the program, the page migration engine 
can use it to place any given page in an optimal manner, so that the maximum 
latency due to remote accesses by any node to this page is minimized. Snap- 
shots from more than one iterations are needed in cases in which some pages 
are ping-pong’ing between more than one nodes due to page-level false sharing. 
This problem can be solved easily in the first few iterations of the program by 
freezing the pages that tend to bounce between nodes [9] . 

The iterative mechanism makes very accurate page migration decisions and 
amortizes well the cost of page migrations, since all the page movement activity 
is concentrated in the first iteration of the parallel program. This is also the rea- 
son that this mechanism is an effective alternative to an initial data distribution 
scheme. UPMlih actually deactivates the mechanism after detecting that page 
placement is stabilized and no further page migrations are needed to reduce the 
rate of remote memory accesses. Figure 1 gives an example of the usage of the 
iterative page migration mechanism in the NAS BT benchmark. In this example, 
u , rhs and forcing are identified as hot memory areas by the compiler and moni- 
toring of page references is activated on these areas via the upmlibjnemref cnt () 
call to the runtime system. The function upmlib_migrate_memory () applies a 
competitive page migration criterion on all pages in the hot memory areas and 
moves the pages that satisfy this criterion [9] . 

In cases in which the complete memory reference pattern of a program cannot 
be accurately identified, UPMlib uses a mechanism which samples periodically 
the memory reference counters of a number of pages and migrates the pages that 
appear to concentrate excessive remote memory accesses. The sampling-based 
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call upmlib_init () 
call upmlib_memref cnt (u, size) 
call upmlib_memref cnt (rhs , size) 
call upmlib_memref cnt (forcing, size) 

do step=l, niter 

call compute_rhs 
call x_solve 
call y_solve 
call z_solve 
call add 

call upmlib_migrate_memory () 
enddo 

Fig. 1. Using the iterative page migration mechanism of UPMlib in NAS BT 



page migration mechanism is implemented with a memory management thread 
that wakes up periodically and scans a fraction of the pages in the hot memory 
areas to detect pages candidate for migration. The length of the sampling interval 
and the amount of pages scanned upon each invocation are tunable parameters. 
Due to the cost of page migrations, the duration of the sampling interval must 
be at least a few hundred milliseconds, in order to provide the page migration 
engine with a reasonable time frame for migrating pages and moving the cost of 
some remote accesses off the critical path. 

The effectiveness of the sampling mechanism depends heavily on the charac- 
teristics of the temporal locality of the program at the page level. The coarser 
the temporal locality, the better the effectiveness of the sampling mechanism. 
Assume that the cost of a page migration is 1 ms (typical value for state of the 
art systems) and a program has a resident set of 3000 pages (typical value for 
popular benchmarks like NAS). In a worst-case scenario in which all the pages 
are misplaced, the page migration engine needs 3 seconds to fix the page place- 
ment, if the memory access pattern of the program remains uniform while pages 
are being moved by the runtime system. Clearly, if the program has execution 
time or phases of duration less than 3 seconds, there is not enough time for the 
page migration engine to move the misplaced pages. The sampling mechanism 
is therefore expected to have robust behaviour for programs with reasonably 
long execution times or reasonably short resident sets with respect to the cost 
of coherent page migration by the operating system. 

2.2 Data Redistribution 

Data redistribution in data parallel programming models such as HPF requires 
identification of phase changes in the reference pattern of the programs. In anal- 
ogy to HPF, a phase in OpenMP can be defined as a sequence of basic blocks 
in which the program has a uniform communication pattern among processors. 
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Each phase may encapsulate more than one OpenMP PARALLEL constructs. Un- 
der this simple definition the OpenMP compiler can use the page migration 
engine to establish implicitly an appropriate page placement scheme before the 
beginning of each phase. The hard problem that has to be addressed in this case 
is how can the page migration engine identify a good page placement for each 
phase in the program, using only implicit memory reference information available 
from the hardware counters. 

UPMlib uses a mechanism called record/replay to address the aforementioned 
problem. This mechanism is conceptually similar to the record/replay barriers 
described in [6]. The record/replay mechanism handles effectively strictly iter- 
ative parallel codes in which the same memory reference trace is repeated for 
a number of iterations. For non-iterative programs, or iterative programs with 
non-repetitive access patterns, UPMlib employs the sampling mechanism out- 
lined in Section 2.1. The record/replay mechanism is activated as follows. The 
compiler instruments the OpenMP program to record the page reference coun- 
ters at all phase transition points. The recording procedure stores two sets of 
reference traces per phase, one at the beginning of the phase and one before the 
transition to the next phase. The recording mechanism is activated only during 
the first iteration. UPMlib estimates the memory reference trace of each phase 
by comparing the two sets of counters that were recorded at the phase bound- 
aries. The runtime system identifies the pages that should move in order to tune 
page placement before the transition to a phase, by applying the competitive 
criterion to all pages accessed during the phase, based on the corresponding ref- 
erence trace. After the last phase of the first iteration, the program can simply 
undo the page migrations executed at all phase transition points by sending the 
pages back to their original homes. This action recovers the initial page place- 
ment scheme. In subsequent iterations, the runtime system replays the recorded 
page migrations at the respective phase transition points. 

Figure 2 gives an example of how the record/replay mechanism is used in 
the NAS BT benchmark. BT has a phase change in the routine z_solve, due 
to the alignment of data in memory, which is done along the x and y dimen- 
sions. The page reference counters are recorded before and after the first exe- 
cution of z_solve and the recorded values are used to identify page migrations 
which are replayed before every execution of z_solve in subsequent iterations. 
The routine upmlib_undo () is used to undo the page migrations performed by 
upmlibjreplay 0, in order to recover the initial page placement scheme that is 
tuned for x_solve,y_solve and add. 

With the record/replay mechanism, page migrations necessarily reside on the 
critical path of the program. The mechanism is sensitive to the granularity of 
phases and is expected to work in cases in which the duration of each phase is 
long enough to amortize the cost of page migrations. In order to limit this cost, 
the record/replay mechanism can optionally move the n most critical pages in 
each iteration, where n is a tunable parameter set experimentally to balance the 
overhead of page migrations with the earnings from reducing the rate of remote 
memory accesses. The n most critical pages are determined as follows: the pages 
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call upmlib_init () 
call upmlib_memref cnt (u, size) 
call upmlib_memref cnt (rhs , size) 
call upmlib_memref cnt (forcing, size) 

do step=l, niter 

call compute_rhs 
call x_solve 
call y_solve 
if (step .eq. 1) then 
call upmlib_record() 
else 

call upmlib_replay 0 
endif 

call z_solve 
if (step .eq. 1) then 
call upmlib_record() 
else 

call upmlib_undo 0 
endif 
call add 
enddo 

Fig. 2. Using the UPMlib record/replay mechanism in NAS BT 

are sorted in descending order according to the ratio , where lacc is the 

number of local accesses from the home node of the page and raccmax is the 
maximum number of remote accesses from any of the other nodes. The pages 
that satisfy the inequality > thr, where thr is a predefined threshold 

are considered as eligible for migration. Let m be the number of these pages. 
If 771 > n, the mechanism migrates the n pages with the highest ratios . 

Otherwise, the mechanism migrates the m eligible pages. 

Similarly to data distribution and redistribution, the mechanisms described 
in Sections 2.1 and 2.2 can be combined effectively to obtain the best of the two 
functionalities in OpenMP programs. For example, the iterative page migration 
mechanism can be used in the first few iterations of a program to establish 
quickly a good initial page placement. The record/replay mechanism can be 
activated afterwards to optimize page placement across phase changes. 

3 Experimental Results 

We provide a set of experimental results that substantiate our argument that 
dynamic page migration is an effective substitute for page distribution and re- 
distribution in OpenMP. Our results are constrained by the fact that we were 
able to experiment only with iterative parallel codes — the OpenMP implemen- 
tations of the NAS benchmarks as provided by their vendors developers [5]. 
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Fig. 3. Performance of UPMlib with different page placement schemes 



Therefore, we follow a synthetic experimental approach for the cases in which 
the characteristics of the benchmarks do not meet the analysis requirements. 
All the experiments were conducted on 16 idle processors of a 64-processor SGI 
0rigin2000 with MIPS RIOOOO processors running at 250 MHz and 8 Gbytes of 
memory. The system ran version 6.5.5 of the SGI IRIX OS. 



3.1 Data Distribution 

We conducted the following experiment to assess the effectiveness of the itera- 
tive mechanism of UPMlib. We used the optimized OpenMP implementations of 
five NAS benchmarks(BT,SP,GG,MG,FT), which were customized to exploit the 
first-touch page placement scheme of the SGI 0rigin2000 [8] . Gonsidering first- 
touch as the page placement scheme that achieves the best data distribution for 
these codes, we ran the codes using three alternative page placement schemes, 
namely round-robin page placement, random page placement and worst-case 
page placement. Round-robin page placement could be optionally requested via 
an environment variable. Random page placement was hand-coded in the bench- 
marks, using the standard UNIX page protection mechanism to capture page 
faults and relocate pages, thus bypassing the default operating system strategy. 
Worst-case page placement was forced by a sequential execution of the cold-start 
iteration of each program, during which all data pages were placed on a single 
node of the system. 

Figure 3 shows the results from executing the OpenMP implementations of 
the NAS BT and GG benchmarks with four page placement schemes. The ob- 
served trends are similar for all the NAS benchmarks used in the experiments. 
We omit the charts for the rest of the benchmarks due to space considerations. 
Each bar is an average of three independent experiments. The variance in all 
cases was negligible. The black bars illustrate the execution time with the dif- 
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Fig. 4. Performance of the sampling page migration mechanism of UPMlib 



ferent page placement schemes, labeled as ft-IRIX, rr-IRIX, rand-IRIX and 
wc-IRIX, for first-touch, round-robin, random, and worst-case page placement 
respectively. The light gray bars illustrate the execution time with the same 
page placement scheme and the IRIX page migration engine enabled during the 
execution of the benchmarks (same labels with suffix -IRIXmig). The dark gray 
bars illustrate the execution time with the UPMlib iterative page migration 
mechanism enabled in the benchmarks (same labels with suffix -upmlib). The 
horizontal lines show the baseline performance with the native first-touch page 
placement scheme of IRIX. 

The results show that page placement schemes other than first-touch in- 
cur significant slowdowns compared to first-touch, ranging from 24% to 210%. 
The same phenomenon is observed even when page migration is enabled in the 
IRIX kernel, although page migration generally improves performance. On the 
other hand, when the suboptimal page placement schemes are combined with 
the iterative page migration mechanism of UPMlib, they approximate closely 
the performance of first-touch. When the page migration engine of UPMlih is 
injected in the benchmarks, the average performance difference between first- 
touch and the other page placement schemes is as low as 5%. In the last half 
iterations of the programs the performance difference was measured less than 
1%. This practically means that the iterative page migration mechanism ap- 
proaches rapidly the best initial page placement in each program. It also means 
that the performance of OpenMP programs can be immune to the page place- 
ment strategy of the operating system as soon as a page migration engine can 
relocate early poorly placed pages. No programmer intervention is required to 
achieve this level of optimization. 

In order to assess the effectiveness of the sampling-based page migration 
engine of UPMlib, we conducted the following experiment. We activated the 
sampling mechanism in the NAS benchmarks and compared the performance 
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obtained with the sampling mechanism against the performance obtained with 
the iterative mechanism. The iterative mechanism is tuned to exploit the struc- 
ture of the NAS benchmarks and can therefore serve as a meaningful performance 
boundary for the sampling mechanism. 

Figure 4 illustrates the execution times obtained with the sampling mecha- 
nism, compared to the execution times obtained with the iterative mechanism 
in the NAS BT and CG benchmarks. BT is a relatively long running code with 
execution time in the order of one and a half minute. On the other hand, CG’s 
execution time is only a few seconds. For BT, we used a sampling frequency of 
100 pages per second. For GG, we used a sampling frequency of 100 pages per 
300 milliseconds. In the case of BT, the sampling mechanism is able to obtain 
performance essentially identical to that of the iterative mechanism. A similar 
trend was observed for SP, the execution time of which is similar to that of 
BT. However, for the short-running GG benchmark, despite the use of a higher 
sampling frequency the sampling mechanism performs consistently significantly 
worse than the iterative mechanism. The same happens with MG and FT. The 
results mainly demonstrate the sensitivity of the sampling mechanism to the ex- 
ecution characteristics of the applications. The sampling mechanism is unlikely 
to benefit short running codes. 

3.2 Data Redistribution 

We evaluated the ability of our user-level page migration engine to emulate data 
redistribution, by activating the record/replay mechanism of UMPlib in the NAS 
BT and SP benchmarks. Both benchmarks have a phase change in the execution 
of the z_solve function, as shown in Fig. 2. In these experiments, we restrict 
the page migration engine to move the n most critical pages across phases. The 
parameter n was set equal to 20. 

Figure 5 illustrates the performance of the record/replay mechanism with 
first-touch page placement (labeled ft-recrep in the charts), as well as the per- 
formance of a hybrid scheme that uses the iterative page migration mechanism 
in the first few iterations of the programs and the record/replay mechanism in 
the rest of the iterations, as described in Section 2.2 (labeled ft-hybrid in the 
charts). The striped part of the bars labeled ft-recrep and ft-hybrid shows 
the non-overlapped overhead of the record/replay mechanism. For illustrative 
purposes the figure shows also the execution time of BT and SP with first-touch 
and the IRIX page migration engine, as well as the execution time with the 
iterative page migration mechanism of UPMlib. 

The results indicate that applying page migration for fine-grain tuning across 
phase changes may be non-profitable due to the excessive overhead of page move- 
ments in the operating system. In the cases of BT and SP the overhead of the 
record/replay mechanism appears to outweigh the gains from reducing the rate 
of remote memory accesses. A more detailed analysis of the codes reveals that 
the parallel execution of z_solve in BT and SP on 16 processors takes approx- 
imately 130 to 180 ms. The recording mechanism of UPMlib identifies between 
160 and 250 pages to migrate before the execution of z_solve. The cost of a page 



Leveraging Transparent Data Distribution in OpenMP 425 



100 -|- 





NAS BT, Class A, 16 processors 



NAS SP, Class A, 16 processors 



Fig. 5. Performance of the record/replay mechanism for NAS BT and SP 



migration in the system we were experimenting with was measured to range be- 
tween 1 and 1.3 ms, depending on the distance between the nodes that competed 
for the page. The total cost for moving all the pages identified by the recording 
mechanism as candidates for migration would exceed significantly the duration 
of the phase, making the record/replay mechanism useless. 

Migrating the 20 most critical pages was able to reduce the execution time of 
useful computation by about 10% in the case of BT. However, the overhead of 
page migrations outweighed the earnings. The performance of the record/replay 
mechanism in the SP benchmark was disappointing. The limited improvements 
are partially attributed to the architectural characteristics of the 0rigin2000 
and most notably the very low remote to local memory access latency ratio of 
the system, which is about 2:1 on the scale on which we experimented. The re- 
duction of remote memory accesses would have a more significant performance 
impact on systems with higher remote to local memory access latency ratios. 
We have experimented with larger values for n and observed significant perfor- 
mance degradation, attributed again to the page migration overhead. The hybrid 
scheme appears to outperform the record/replay scheme (marginally for BT and 
significantly for SP), but it is still biased by the overhead of page migrations. 

In order to quantify the extent to which the record/replay mechanism is 
applicable we executed the following synthetic experiment. We modified the 
code in NAS BT to quadruple the amount of work performed in each iteration of 
the parallel computation. We did not change the problem size of the program to 
preserve its locality characteristics. We rather enclosed each of the functions that 
comprise the main body of the computation (x_solve,y_solve,z_solve,add) 
in a loop. With this modification, we lengthened the duration of the parallel 
execution of z_solve to approximately 500 ms. Figure 6 shows the results from 
these experiments. It is evident that a better amortization of the overhead of 
page migrations helps the record/replay mechanism. In this experiment, the cost 
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Fig. 6. Performance of the record/replay mechanism in the synthetic experiment 
with NAS BT 



of the record/replay mechanism remains the same as in the previous experiments, 
however, the reduction of remote memory accesses achieved by the mechanism 
is exploited over a longer period of time. This yields a performance improvement 
of 5% over the iterative page migration mechanism. 

4 Conclusion 

This paper presented and evaluated mechanisms for transparent data distribu- 
tion in OpenMP programs. The mechanisms leverage dynamic page migration 
as an oblivious to the programmer data distribution technique. We have shown 
that effective initial page placement can be established with a smart user-level 
page migration engine that exploits the iterative structure of parallel codes. 
Our results demonstrate clearly that the need for introducing data distribution 
directives in OpenMP is obscure and may not warrant the implementation and 
standardization costs. On the other hand, we have shown that although page mi- 
gration may be effective for coarse-grain optimization of data locality, it suffers 
from excessive overhead when applied for tuning page placement at fine-grain 
time scales. It is therefore critical to estimate the cost/performance tradeoffs of 
page migration in order to investigate up to which extent can aggressive page 
migration strategies work effectively in place of data distribution and redistribu- 
tion on distributed shared-memory multiprocessors. Since the same investigation 
would be necessary in data-parallel environments also, we do not consider it as 
a major restriction of our environment. 
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Abstract. Performance analysis is an important step in tuning perfor- 
mance critical applications. It is a cyclic process of measuring and ana- 
lyzing performance data which is driven by the programmer’s hypotheses 
on potential performance problems. Currently this process is controlled 
manually by the programmer. We believe that the implicit knowledge ap- 
plied in this cyclic process should be formalized in order to provide auto- 
matic performance analysis for a wider class of programming paradigms 
and target architectures. This article describes the performance prop- 
erty specification language (ASL) developed in the APART Esprit IV 
working group which allows specifying performance-related data by an 
object-oriented model and performance properties by functions and con- 
straints defined over performance-related data. Performance problems 
and bottlenecks can then be identified based on user- or tool-defined 
thresholds. In order to demonstrate the usefulness of ASL we apply it 
to OpenMP by successfully formalizing several OpenMP performance 
properties. 



Keywords: performance analysis, knowledge representation, OpenMP, per- 
formance problems, language design 

1 Introduction 

Performance-oriented program development can be a daunting task. In order 
to achieve high or at least respectable performance on today’s multiprocessor 
systems, careful attention to a plethora of system and programming paradigm 
details is required. Commonly programmers go through many cycles of experi- 
mentation involving gathering performance data, performance data analysis (a- 
priori and postmortem), detection of performance problems, and code refine- 
ments in slow progression. Clearly, the programmer must be intimately familiar 

* The ESPRIT IV Working Group on Automatic Performance Analysis: Resources 
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with many aspects related to this experimentation process. Although there exists 
a large number of tools assisting the programmer in performance experimenta- 
tion, it is still the programmer’s responsibility to take most strategic decisions. 

In this article we describe a novel approach to formalize performance bottle- 
necks and the data required in detecting those bottlenecks with the aim to sup- 
port automatic performance analysis for a wider class of programming paradigms 
and architectures. This research is done as part of APART Esprit IV Working 
Group on Automatic Performance Analysis: Resources and Tools [Apart 99]. In 
the remainder of this article we use the following terminology: 



Performance-related Data: Performance-related data defines information that 
can be used to describe performance properties of a program. There are two 
classes of performance related data. First, static data specifies information 
that can be determined without executing a program on a target machine. 
Second, dynamic performance-related data describes the dynamic behavior 
of a program during execution on a target machine. 

Performance Property: A performance property (e.g. load imbalance, com- 
munication, cache misses, redundant computations, etc.) characterizes a spe- 
cific performance behavior of a program and can be checked by a set of con- 
ditions. Conditions are associated with a confidence value (between 0 and 
1) indicating the degree of confidence about the existence of a performance 
property. In addition, for every performance property a severity figure is 
provided that specifies the importance of the property. 

Performance Problem: A performance property is a performance problem, 
iff its severity is greater than a user- or tool-defined threshold. 

Performance Bottleneck: A program can have one or several performance 
bottlenecks which are characterized by having the highest severity figure. If 
these bottlenecks are not a performance problem, then the program’s per- 
formance is acceptable and does not need any further tuning. 



This paper introduces the APART Specification Language (ASL) which al- 
lows the description of performance-related data through the provision of an 
object-oriented specification model and which supports the definition of perfor- 
mance properties in a novel formal notation. Our object-oriented specification 
model is used to declare - without the need to compute - performance infor- 
mation. It is similar to Java, uses only single inheritance and does not require 
methods. A novel syntax has been introduced to specify performance properties. 

The organization of this article is as follows. Section 2 presents in related 
work. Section 3 presents ASL constructs for specifying performance-related data 
and, as examples, classes specifying performance-related data for OpenMP pro- 
grams. The syntax for the specification of performance properties is described 
in Section 4. Examples for OpenMP property specifications are presented in 
Section 5. Conclusions and Future work are discussed in Section 6. 
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2 Related work 

The use of specification languages in the context of automatic performance anal- 
ysis tools is a new approach. Paradyn [MCCHI 95] performs an automatic online 
analysis and is based on dynamic monitoring. While the underlying metrics can 
be defined via the Metric Description Language (MDL), the set of searched 
bottlenecks is fixed. It includes CPUbound, ExcessiveSyncWaitingTime, Exces- 
siv el 0 Blocking Time, and TooManySmalllOOps. 

A rule-based specification of performance bottlenecks and of the analysis 
process was developed for the performance analysis tool OPAL [GKO 95] in the 
SVM-Fortran project. The rule base consists of a set of parameterized hypothesis 
with proof rules and refinement rules. The proof rules determine whether a hy- 
pothesis is valid based on the measured performance data. The refinement rules 
specify which new hypotheses are generated from a proven hypothesis [GeKr 97]. 

Another approach is to define a performance bottleneck as an event pattern in 
program traces. EDL [Bates 83] allows the definition of compound events based 
on extended regular expressions. EARL [WoMo 99] describes event patterns in a 
more procedural fashion as scripts in a high-level event trace analysis language 
which is implemented as an extension of common scripting languages like Tcl, 
Perl or Python. 

The language presented here served as a starting point for the definition 
of the ASL but is too limited. MDL does not allow to access static program 
information and to integrate information for multiple performance tools. It is 
specially designed for Paradyn. The design of OPAL focuses more on the for- 
malization of the analysis process and EDL and EARL are limited to pattern 
matching in performance traces. 

Some other performance analysis and optimization tools apply automatic 
techniques without being based on a special bottleneck specification, such as 
KAPPA-PI [EsMaLu 98], EINESSE [MuRiGu 00], and the online tuning system 
Autopilot [RVSR 98]. 

3 Performance-related Data Specification 

This section presents the performance-related data specification for OpenMP 
[DaMe 99]. Performance-related data are specified in ASL by a set of of classes 
following an object-oriented style with single-inheritance. In this article we present 
the classes as UML diagramms. More details on the specification including a 
number of base classe for all models can be found in [Apart WP2 99] 

Several classes were defined that model static information for OpenMP pro- 
grams. Class SmRegion is a subclass of the standard class Region and contains an 
attribute with data dependence information about the modeled region. SmRe- 
gion is then further refined by two subclasses ParallelRegion and SequentialRe- 
gion which, respectively, describe parallel and sequential regions. Parallel regions 
include a boolean variable nojwait-cxit which denotes whether or not the region 
is terminated by an implicit exit barrier operation. A specific execution of a 
region corresponds to a region instance. 
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Fig. 1. OpenMP classes for dynamic information 



Figure 1 shows the OpenMP class library for dynamic information. Class 
SmRegionSummary extends the standard class RegionSummary and comprises 
three attributes: nr-executions specifies the number of times a region has been 
executed by the master thread, sums describes summary information across all 
region instances, and instancesums relates to summary information for a specific 
region instance. The attributes of class SmSums include: 

— duration: time needed to execute region by master thread 

— non-parallelized-code: time needed to execute non-parallelized code 

— se,.fractu,n: 

— nrjremote-accesses: number of accesses to remote memory by load and store 
operations in ccNUMA machines 

— scheduling: time needed for scheduling operations (e.g. scheduling of threads) 

— additionaLcalc: time needed for additional computations in parallelized code 
(e.g. to enforce a specific distribution of loop iterations) or for additional 
computations (e.g. where it is cheaper for all threads to compute a value 
rather than communicate it, possibly with synchronization costs) 
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— cross Jhread-dep -Ctrl', synchronization time except for entry and exit barriers 
and waiting in locks 

— cross-thread-dep-wait: synchronization waiting time except waiting in entry 
or exit barrier 

— regiori-wait: waiting time in entry or exit barrier 

— regiori-ctrl: time needed to execute region control instructions (e.g. control- 
ling barriers) 

— nr-cachc-misses: number of cache misses 

— thread-Sums: summary data for every thread executing the region 

— access ed-variahles'. set of remote access counts for individual variables refer- 
enced in that region 

Note that attributes duration and region-ctrl are given with respect to the 
master thread, whereas all other attributes are average values across all threads 
that execute a region. Summary data (described by class SmThreadSums) for 
every thread executing a region is specified by thread-sums in SmSums. The 
attributes of SmThreadSums are a subset of class SmSums attributes and refer 
to summary information for specific threads identified by a unique thread number 
(thread-no). 

In addition to the number of remote accesses in a region, the number of 
remote accesses is collected for individual variables that are referenced in that 
region. This information is modeled by the class VariableRemoteAccesses with 
the attributes var-name, nr-remotc-accesses, and size which denotes the total 
size of the variable in bytes. This information can be measured if address range 
specific monitoring is supported, e.g. [KaLeObWa 98]. The last attribute of this 
class is pagesums which is a set of page-level remote access counters. For exam- 
ple, the remote access counters on SGI Origin 2000 provide such information. 
With the help of additional mapping information, i.e. mapping variables to ad- 
dresses, this information can be related back to program variables. Each object 
of class PageRemoteAccesses determines the page-no and the number of remote 
accesses. 

The second attribute of SmRegionSummary is given by instancesums which 
is described by a class SmInstanceSums. This class specifies summary infor- 
mation for a specific region instance. SmInstanceSums contains all attributes 
of SmSums and the number of threads executing the region instance. Finally, 
class SmThreadInstanceSums describes summary information for a given region 
instance with respect to individual threads. 

4 Performance Property Specification 

A performance property (e.g. load imbalance, remote accesses, cache misses, re- 
dundant computations, etc.) characterizes a specific performance behavior of a 
program. The AST property specification syntax defines the name of the prop- 
erty, its context via a list of parameters, and the condition, confidence, and 
severity expressions. The property specification is based on a set of parameters. 
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These parameters specify the property’s context and parameterize the expres- 
sions. The context specifies the environment in which the property is evaluated, 
e.g. the program region and the test run. Details can be found in [Apart WP2 99]. 

The condition specification consists of a list of conditions. A condition is 
a predicate that can be prefixed by a condition identifier. The identifiers have 
to be unique with respect to the property since the confidence and severity 
specifications may refer to the conditions by using the condition identifiers. 

The confidence specification is an expression that computes the maximum of 
a list of confidence values. Each confidence value is computed as an arithmetic 
expression in an interval between zero and one. The expression may be guarded 
by a condition identifier introduced in the condition specification. The condition 
identifier represents the value of the condition. 

The severity specification has the same structure as the confidence specifi- 
cation. It computes the maximum of the individual severity expressions of the 
conditions. The severity specification will typically be based on a parameter 
specifying the ranking basis. If, for example, a representative test run of the 
application has been monitored, the time spent in remote accesses may be com- 
pared to the total execution time. If, instead, a short test run is the basis for 
performance evaluation since the application has a cyclic behavior, the remote 
access overhead may be compared to the execution time of the shortened loop. 

5 OpenMP Performance Properties 

This section demonstrates the ASL constructs for specifying performance prop- 
erties in the context of the shared memory, OpenMP paradigm. Some global 
definitions are presented first. These are then used in the definitions of a num- 
ber of OpenMP properties. 



Global definitions 

In most property specifications it is necessary to access the summary data of a 
given region for a given experiment. Therefore, we define the summary function 
that returns the appropriate SmRegionSummary object. It is based on the set 
operation UNIQUE that selects arbitrarily one element from the set argument 
which has cardinality one due to the design of the data model. 

SmRegionSummary summary (Region r, Experiment e)= 

UNIQUE({s IN e. profile WITH s .region==r}) ; 

The sync function determines the overhead for synchronization in a given 
region. It computes the sum of the relevant attributes (which are deemed by the 
property specifier to be components of synchronisation cost) in the summary 
class. 

float sync(Region r, Experiment e) =summary (r ,e) . sums .region_wait + 

summary (r ,e) . sums .region_ctrl + 
summary (r ,e) . sums . cross_thread_dep_wait + 
summary (r ,e) . sums . cross_thread_dep_ctrl ; 
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The duration function returns the execution time of a region. The execution 
time is determined by the execution time of the master thread in the OpenMP 
model. 



float duration(Region r, Experiment e)=summary(r , e) . sums . duration; 



The remote-access-time function estimates the overhead for accessing remote 
memory based on the measured number of accesses and the mean access time of 
the parallel machine. 



float remote_access_time (Region r, Experiment e)= 

summary (r, e) . sums . nr_remote_accesses 
* e. system. remote_access_time) ; 



Property specifications 



The costs property determines whether the total parallel overhead in the execu- 
tion is non-zero. 



Property costs (Region r, Experiment seq, 

Experiment par, Region rank_basis){ 



LET 

float total_costs = duration(r ,par) - (duration(r , seq) / 
par .nr_processors) ; 

IN 

CONDITION: total_costs>0; 

CONFIDENCE: 1; 

SEVERITY: total_costs / duration (rank_basis, par) ; 



This property specifies that the speedup of the application is not simply the 
serial execution time divided by the number of processors, i.e. that the naive ideal 
linear speedup is not being achieved. It uses information from two experiments, 
a sequential run and a parallel run, to compute the costs of parallel execution. 
Those costs determine the severity of the property. 

Performance analysis tools initially only help in analyzing costs for properties 
which are known to affect performance. After accounting for these properties, 
any remaining costs must be due new, as yet unencountered, effects. A region 
has the identified-costs property if the sum of the expected potential costs is 
greater than zero. The severity of this property is the fraction of those costs 
relative to the execution time of rank-basis. 
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Property identified_costs (Region r, Experiment e, Region rank_basis){ 
LET 

float costs = summary (r, e) . sums . non_parallelized_code + 
sync(r,e) + 

remote_access_time (r ,e) + 
summary (r , e) . sums . scheduling + 
summary (r, e) . sums . additional_calc ; 

IN 

CONDITION: costs>0; 

CONFIDENCE: 1; 

SEVERITY: costs (r ,e) /duration (rank_basis ,e) ; 

} 

The total cost of the parallel program is the sum of the identified and the 
unidentified overhead. The unidentified-Costs property determines whether an 
unidentified overhead exists. Its severity is the fraction of this overhead in re- 
lation to the execution time of rank-basis. If this fraction is high, further tool- 
supported performance analysis might be required. 

Property unidentified_costs (Region r, Experiment seq, Experiment e, 

Region rank_basis){ 

LET 

float totcosts = duration(r ,e) - (duration (r, seq) /e.nr_processors) ; 
float costs = summary (r, e) . sums . non_parallelized_code+sync (r , e) + 

remote_access_time (r ,e)+summary (r , e) . sums . scheduling 
tsummary (r ,e) . sums . additional_calc ; 

IN 

CONDITION: totcosts-costs>0 ; 

CONFIDENCE: 1; 

SEVERITY : totcosts (r, e) /duration (rank_basis ,e) ; 

} 

N on-parallelized eode is a very severe problem for application scaling. In the 
context of analyzing a given program run, its severity is determined in the usual 
way, relative to the duration of the rank basis. If the focus of the analysis is more 
on application scaling the severity could be redefined to stress the importance 
of this property. 

Property non_parallelized_code (Region r, Experiment e, 

Region rank_basis){ 

LET 

float non_par_code = summary (r ,e) . sums .non_parallelized_code>0; 

IN 

CONDITION: non_par_code>0 ; 

CONFIDENCE: 1; 

SEVERITY: non_par_code/duration(rank_basis , e) ; 

} 

A region has the following synehronization property if any synchronization 
overhead occurs during its execution. One of the obvious reasons for high syn- 
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chronization cost is load imbalance, which is an example of a more specific prop- 
erty: an application suffering the load-imbalance property is, by implication, 
suffering synchronisation. 

Property synchronization(Region r, Experiment e, Region rank_basis){ 

CONDITION: sync(r,e)>0; 

CONFIDENCE: 1; 

SEVERITY: sync(r,e)/duration(rank_basis,e) ; 

} 

Property Ioad_imbalance ( Region r, Experiment e, Region rank_basis) { 

CONDITION: summary ( r, e ) . sums .region_wait >0; 

CONFIDENCE: 1; 

SEVERITY: summary! r, e ) . sums .region_wait/duration(r ,e) ; 

} 

When work is unevenly distributed to threads in a region, this manifests itself 
in region-wait time (time spent by threads waiting on the region exit barrier). 
If the region-wait time cannot be measured, the property can also be derived as 
the execution time of the thread with the longest duration minus the average 
thread duration. 

The synchronization property defined above is assigned to regions with an 
aggregate non-zero synchronization cost during the entire execution of the pro- 
gram. If the dynamic behaviour of an application changes over the execution 
time — load imbalance, for example, might occur only in specific phases of the 
simulation — the whole synchronization overhead might result from specific in- 
stances of the region. The irregularsync-acrossJnstances property identifies this 
case. The severity is equal to the severity of the synchronization property since 
the irregularsync-across-instances property is only a more detailed explanation. 

Property irreguIar_sync_across_ instances 

(Region r, Experiment e, Region rank_basis){ 

LET 

float inst_sync (SmInstanceSums sum)=sum.region_wait + 
sum. region_ctrl + sum. cross_thread_dep_wait + 
sum. cross_thread_dep_ctrl ; 



IN 

CONDITION : stdev(inst_sync (inst_sum) 

WHERE inst_sum IN summary (r, e) . instance_sums) 

> irreg_behaviour_threshold * sync (r ,e)/r . nr_executions ; 

CONFIDENCE: 1; 

SEVERITY: sync(r,e)/duration(rank_basis,e) ; 

} 

An important property for code executing on ccNUMA machines, remote ac- 
cesses, arises from access to data in memory located on nodes other than that of 
the requesting processor. Remote memory access involves communication among 
parallel threads. Since, usually, only the number of accesses can be measured, 
the severity is estimated based on the mean access time. 
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Property remote_accesses (Region r, Experiment e, Region rank_basis){ 
CONDITION: summary (r,e) .nr_remote_accesses>0; 

CONFIDENCE: 1; 

SEVERITY: remote_access_time(r,e) / duration(rank_basis ,e) ; 



A previous property (remote-accesses) identifies regions with remote accesses. 
The next property, remote-accessJ,o -variable, is more specific than this since its 
context also includes a specific variable. The property indicates whether accesses 
to a variable in this region result in remote accesses. It is based on address-range- 
specific remote access counters, such as those provided by the SGI Origin 2000 
on a page basis. The severity of this property is based on the time spent in 
remote accesses to this variable. Since this property is very useful in explaining 
a severe remote access overhead for the region, it might be ranked with respect 
to this region, rather than with respect to the whole program, during a more 
detailed analysis. 



Property remote_access_to_variable 

(Region r, Experiment e, String var, Region rank_basis) 

{ 

LET 



VariableRemoteAccesses var_sum = 

UNIQUE ({info IN summary (r ,e) . sums . accessed_variables 
WITH info . var_neime==var}) ; 



IN 



} 



CONDITION: var_sum. nr_remote_accesses > 0; 

CONFIDENCE: 1; 

SEVERITY: var_sum. nr_remote_accesses * e. system. remote_access_time 

/duration(rank_basis ,e) ; 



6 Conclusions and Future Work 

In this article we describe a novel approach to the formalization of performance 
problems and the data required to detect them with the future aim of supporting 
automatic performance analysis for a large variety of programming paradigms 
and architectures. We present the APART Specification Language (ASL) devel- 
oped as part of the APART Esprit IV Working Group on Automatic Perfor- 
mance Analysis: Resources and Tools. This language allows the description of 
performance-related data through the provision of an object-oriented specifica- 
tion model and supports definition of performance properties in a novel formal 
notation. 

We applied the ASL to OpenMP by successfully formalizing several OpenMP 
performance properties. ASL has also been used to formalize a large variety of 
MPI and HPF performance properties which is described in [Apart WP2 99]. 

Two extensions to the current language design will be investigated in the 
future: 
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1. The language will be extended by templates which facilitate specification of 
similar performance properties. In the example specification in this paper 
some of the properties result directly from the summary information, e.g. 
synchronization is directly related to the measured time spent in synchro- 
nization. The specifications of these properties are indeed very similar and 
need not be described individually. 

2. Meta-properties may be useful as well. For example, synchronization can be 
proven based on summary information, i.e. synchronization exists if the sum 
of the synchronization time in a region over all processes is greater than 
zero. A more specific property is to check, whether individual instances of 
the region or classes of instances are responsible for the synchronization due 
to some dynamic changes in the load distribution. Similar, more specific 
properties can be deduced for other properties as well. As a consequence, 
meta-properties can be useful to evaluate other properties based on region 
instances instead of region summaries. 

AST should be the basis for a common interface for a variety of performance 
tools that provide performance-related data. Based on this interface we plan to 
develop a system that provides automatic performance analysis for a variety of 
programming paradigms and target architectures. 
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Abstract. The shared-memory programming model is a very effective 
way to achieve parallelism on shared memory parallel computers. As 
great progress was made in hardware and software technologies, per- 
formance of parallel programs with compiler directives has demon- 
strated large improvement. The introduction of OpenMP directives, the 
industrial standard for shared-memory programming, has minimized the 
issue of portability. In this study, we have extended CAPTools, a com- 
puter-aided parallelization toolkit, to automatically generate OpenMP- 
based parallel programs with nominal user assistance. We outline teeh- 
niques used in the implementation of the tool and discuss the application 
of this tool on the NAS Parallel Benchmarks and several computational 
fluid dynamics codes. This work demonstrates the great potential of us- 
ing the tool to quickly port parallel programs and also achieve good per- 
formance that exceeds some of the commercial tools. 



1 Introduction 

Porting applications to high performance parallel computers is always a challenging 
task. It is time consuming and costly. With rapid progressing in hardware architectures 
and increasing complexity of real applications in recent years, the problem becomes 
even more sever. Today, scalability and high performance are mostly involving hand- 
written parallel programs using message-passing libraries (e.g. MPI). However, this 
process is very difficult and often error-prone. The recent reemergence of shared- 
memory parallel (SMP) architectures, such as the caehe coherent Non-Uniform Mem- 
ory Access (ccNUMA) architecture used in the SGI 0rigin2000, show good prospecfs 
for scaling beyond hundreds of processors. Programming on an SMP is simplified by 
working in a globally accessible address space. The user can supply compiler direc- 
tives to parallelize the code without explicit data partitioning. Computation is distrib- 
uted inside a loop based on the index range regardless of data location and the seal- 
ability is achieved by taking advantage of hardware cache coherence. The recent 
emergence of OpenMP [13] as an industry standard offers a portable solution for 
implementing directive-based parallel programs for SMPs. OpenMP overcomes the 
portability issues encountered by maehine-specific directives without sacrificing much 
of the performance and has gained popularity quickly. 

M. Valero et al. (Eds.): ISHPC 2000, LNCS 1940, pp. 440-456, 2000. 

© Springer-Verlag Berlin Heidelberg 2000 
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Although programming with directives is relatively easy (when comparing to writ- 
ing message passing codes), inserted directives may not necessarily enhance perform- 
ance. In the worst cases, they can create erroneous results when used incorrectly. 
While vendors have provided tools to perform error-checking [10], automation in 
directive insertion is very limited and ohen failed on large programs, primarily due to 
the lack of a thorough enough data dependence analysis. To overcome the deficiency, 
we have developed a toolkit, CAPO, to automatically insert OpenMP directives in 
Fortran programs. CAPO is aimed at taking advantage of detailed interprocedural data 
dependence analysis provided by Computer-Aided Parallelization Tools (CAPTools) 
[4], developed at the University of Greenwich, to reduce potential errors made by 
users and, with nominal help from user, achieve performance close to that obtained 
when directives are inserted by hand. Our approach is differed from other tools and 
compilers in two respects: 1) emphasizing the quality of dependence analysis and 
relaxing much of the time constraint on the analysis; 2) performing directive insertion 
and preserving the original code structure for maintainability. Translation of OpenMP 
codes to executables is left to proper OpenMP compilers. 

In the following we first outline the OpenMP programming model and give an 
overview of CAPTools and CAPO for generating OpenMP programs. Then, in Sect. 3 
we discuss the implementation of CAPO. Case studies of using CAPO to parallelize 
the NAS Parallel Benchmarks and two computational fluid dynamics (CFD) applica- 
tions are presented in Sect. 4 and conclusions are given in the last section. 



2 Automatic Generation of OpenMP Directives 

2.1 The OpenMP Programming Model 

OpenMP [13] was designed to facilitate portable implementation of shared memory 
parallel programs. It includes a set of compiler directives and callable runtime library 
routines that extend Fortran, C and C++ to support shared memory parallelism. It 
promises an incremental path for parallelizing sequential software, as well as targeting 
at scalability and performance for any complete rewrites or new construction of appli- 
cations. 

OpenMP follows the fork-and-join execution model. A fork-and-join program ini- 
tializes as a single lightweight process, called the master thread. The master thread 
executes sequentially until the first parallel construct (OMP PARALLEL) is encoun- 
tered. At that point, the master thread creates a team of threads, including itself as a 
member of the team, to concurrently execute the statements in the parallel construct. 
When a work-sharing construct such as a parallel do (OMP DO) is encountered, the 
workload is distributed among the members of the team. An implied synchronization 
occurs at the end of the DO loop unless a „NOWAIT“ is specified. Data sharing of 
variables is specified at the start of parallel or work-sharing constructs using the 
SHARED and PRIVATE clauses. In addition, reduction operations (such as summa- 
tion) can be specified by the REDUCTION clause. Upon completion of the parallel 
construct, the threads in the team synchronize and only the master thread continues 
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execution. The fork-and-join process can be repeated many times in the course of 
program execution. 

Beyond the inclusion of parallel constructs to distribute work to multiple threads, 
OpenMP introduces a powerful concept of orphan directives that greatly simplifies 
the task of implementing coarse grain parallel algorithms. Orphan directives are di- 
rectives outside the lexical extent of a parallel region. This allows the user to specify 
control or synchronization from anywhere inside the parallel region, not just from the 
lexically contained region. 



2.2 CAPTools 

The Computer-Aided Parallelization Tools (CAPTools) [4] is a software toolkit that 
was designed to automate the generation of message-passing parallel code. CAPTools 
accepts FORTRAN-77 serial code as input, performs extensive dependence analysis, 
and uses domain decomposition to exploit parallelism. The tool employs sophisticated 
algorithms to calculate execution control masks and minimize communication. The 
generated parallel codes contain portable interface to message passing standards, such 
as MPl and PVM, through a low-overhead library. 

There are two important strengths that make CAPTools stands out. Firstly, an ex- 
tensive set of extensions [5] to the conventional dependence analysis techniques has 
allowed CAPTools to obtain much more accurate dependence information and, thus, 
produce more efficient parallel code. Secondly, the tool contains a set of browsers that 
allow user to inspect and assist parallelization at different stages. 



2.3 Generating OpenMP Directives 

The goal of developing computer-aided tools to help parallelize applications is to let 
the tools do as much as possible and minimize the amount of tedious and error-prone 
work performed by the user. The key to automatic detection of parallelism in a pro- 
gram and, thus parallelization is to obtain accurate data dependences in the program. 
Generating OpenMP directives is simplified somehow because we are now working in 
a globally addressed space without explicitly concerning data distribution. However, 
we still have to realize that there are always cases in which certain conditions could 
prevent tools from detecting possible parallelization, thus, an interactive user envi- 
ronment is also important. 

The design of the CAPTools-based automatic parallelizer with OpenMP, CAPO, 
had kept the above tactics in mind. The schematic structure of CAPO is illustrated in 
Fig. 1. The detailed implementation of the tool is given in Sect. 3. CAPO takes a 
serial code as input and uses the data dependence analysis engine in CAPTools. User 
knowledge on certain input parameters in the source code may be entered to assist this 
analysis for more accurate results. The process of exploiting loop level parallelism in a 
program and generating OpenMP directives automatically is summarized in the fol- 
lowing three stages. 
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1) Identify parallel loops and parallel regions. The loop-level analysis classifies 
loops as parallel (including reduction), serial or potential pipeline based on the data 
dependence information. Parallel loops to be distributed with work-sharing directives 
for parallel execution are identified by traversing the call graph of the program from 
top to down. Only outer-most parallel loops are considered, partly due to the very 
limited support of multi-level parallelization in available OpenMP compilers. Parallel 
regions are then formed around the distributed parallel loops. Attempt is also made to 
identify and create parallel pipelines. Details are given in Sects. 3. 1-3. 3. 

2) Optimize loops and regions. This stage is mainly for reducing overhead caused 
by fork-and-join and synchronization. A parallel region is first expanded as far as 
possible and may include calls to subroutines that contain additional {orphaned) par- 
allel loops. Regions are then merged together if there is no violation of data usage in 
doing so. Region expansion is currently limited to within a subroutine. Synchroniza- 
tion optimization between loops in a parallel region is performed by checking if the 
loops can be executed asynchronously. Details are given in Sects. 3.2 and 3.4. 

3) Transform codes 
and insert directives. 

Variables in common 
blocks are analyzed for 
their usage in all parallel 
regions in order to iden- 
tify threadprivate com- 
mon blocks. If a private 
variable is used in a non- 
threadprivate common 
block, the variable is 
treated specially. A rou- 
tine needs to be dupli- 
cated if its usage con- 
flicts at different calling 
points. Details are given 
in Sects. 3. 5-3. 7. 

By traversing the call 
graph one more time 
OpenMP directives are 
lastly inserted for parallel 
regions and parallel 
loops with variables 
properly listed. The 
variable usage analysis is 
performed at several 
points to identify how 
variables are used (e.g. 
private, shared, reduction, etc.) in a loop or region. Such analysis is required for the 
identification of loop types, the construction of parallel regions, the treatment of pri- 
vate variables in common blocks, and the insertion of directives. 




Fig. 1. Schematic flow chart of the CAPO architecture 
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Intermediate results can be stored into or retrieved from a database. User assistance 
to the parallelization process is possible through browsers implemented in CAPO 
(Directives Browser) and in CAPTools. The Directives Browser is designed to pro- 
vide more interactive information from the parallelization process, such as reasons 
why loops are parallel or serial, distributed or not distributed. User can concentrate 
on areas where potential improvements could be made, for example, by removing 
false data dependences. It is part of the iterative process of parallelization. 



3 Implementation 

In the following subsections, we will give some implementation details of CAPO 
organized according to the components outlined in Sect. 2.3. 



3.1 Loop-Level Analysis 

In the loop-level analysis, the data dependence information is used to classify loops in 
each routine. Loop types include parallel (including reduction), serial, and pipeline. 
A parallel loop is a loop with no loop-carried data dependences and no exiting state- 
ments that jump out of the loop (e.g. RETURN). Loops with I/O (e.g. READ, WRITE) 
statements are excluded from consideration at this point. Parallel loop includes the 
case where a variable is rewritten during the iteration of the loop but the variable can 
be privatized (i.e. having a local copy on each thread) to remove the loop-carried 
output dependence. 

A reduction loop is considered as a special parallel loop since the loop can first up- 
date partial results in parallel on each thread and then update the final result atomi- 
cally. The reduction operations, such as "+", "min", "max", etc. can be executed 
in a CRITICAL section. This is how the array reduction is implemented later on. 

A special class of loops, called pipeline loop, has loop-carried true dependencies 
and the lengths of these dependence vectors are determinable and with the same sign. 
Such a loop can potentially be used to form parallel pipelining with an outside loop 
nesting. Compiler techniques for finding pipeline parallelism through affine trans- 
forms are discussed in [11]. The pipeline parallelism can be implemented in OpenMP 
directives with point-to-point synchronization. This is discussed in Sect. 3.3. 

A serial loop is a loop that can not be run in parallel due to loop-carried data de- 
pendences, I/O or exiting statements. However, a serial loop may be used for the for- 
mation of a parallel pipeline. 



3.2 Setup of Parallel Region 

In order to achieve good performance, it is not enough to simply stay with parallel 
loops at a finer grained level. In the context of OpenMP, it is possible to express 
coarser-grained parallelism with parallel regions. Our next task is to use the loop-level 
information to define these parallel regions. 
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There are several steps in constructing parallel regions: 

a) Traverse the call graph in a top-down approach and identify parallel loops to be 
distributed. Only outer-most parallel loops with enough granularity are consid- 
ered. Parallel regions are then formed around the distributed parallel loops, in- 
cluding pipeline loops if no parallel loops can be found at the same level and 
parallel pipelines can be formed. 

b) Expand each parallel region as much as possible in a routine to the top-most 
loop nest that contains no I/O and exiting statements and is not part of another 
parallel region. If a variable will be rewritten by multiple threads in the potential 
parallel region and cannot be privatized (the memory access conflict test), back 
down one loop nest level. A reduction loop is in a parallel region by itself 

c) Include in the region any preceded code blocks that satisfy the memory access 
conflict test and are not yet included in other parallel regions. Orphaned direc- 
tives will be used in routines that are called inside a parallel region but outside a 
distributed parallel loop. 

d) Join two neighboring regions to form a larger parallel region if possible. 

e) Treat parallel pipelines across subroutine boundaries if needed (see next subsec- 
tion). 

3.3 Pipeline Setup 

A potential pipeline loop (as introduced in Sect. 3.1) can be identified by analyzing 
the dependence vectors in symbolic form. In order to set up a parallel pipeline, an 
outer loop nest is required. If the top loop in a potentially parallel region is a pipeline 
loop and the loop is also in the top-level loop nesting of the routine, then the loop is 
further checked for loop nest in the immediate parent routine. The loop in the parent 
routine can be used to form an „upper“ level parallel pipeline only if all the following 
tests are true: a) such a parent loop nest exists, b) each routine called inside the parent 
loop contains only a single parallel region, and c) except for pipeline loops all distrib- 
uted parallel loops can run asynchronously (see Sect. 3.4). If any of the tests failed, 
the pipeline loop will be treated as a serial loop. 

OpenMP provides directives (e.g. „OMP FLUSH“) and library functions to perform 
the point-to-point synchronization, which makes the implementation of pipeline par- 
allelism possible with directives. These directives and functions are used to ensure the 
pipelined code section will not be executed in a thread before the work in the neigh- 
boring thread is done. Such an execution requires the scheduling scheme for the pipe- 
line loop to be STATIC and ORDERED. Our implementation of parallel pipelines with 
directives is started from an example given in the OpenMP Program Application Inter- 
face [13]. The pipeline algorithm is used for parallelizing the NAS benchmark LU in 
Sect. 4.1 and also described in [12]. 
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3.4 End-of-Loop Synchronization 

Synchronization is used to ensure the correctness of program execution after a parallel 
construct (such as END PARALLEL or END DO). By default, synchronization is 
added at the end of a parallel loop. Sometime the synchronization at the end of a loop 
can be eliminated to reduce the overhead. We used a technique similar to the one 
in [15] to remove synchronization between two loops. 

To be able to execute two loops asynchronously and to avoid a thread synchroniza- 
tion directive between them we have to perform a number of tests aside from the de- 
pendence information provided by CAPTools. The tests verify whether a thread exe- 
cuting a portion of the instructions of one loop will not read/write data read/written by 
a different thread executing a portion of another loop. Hence, for each non-private 
array we check that the set of written locations of the array by the first thread and the 
set of read/written locations of the array by the second thread do not intersect. The 
condition is known as the Bernstein condition (see [9]). If Bernstein condition (BC) is 
true the loops can be executed asynchronously. The BC test is performed in two steps. 
The final decision is made conservatively: if there is no proof that BC is true it set to 
be false. We assume that the same number of threads execute both loops, the number 
of threads is larger than one and there is an array read/written in both loops. 

Check the number of loop iterations. Since the number of the threads can be arbi- 
trary, the number of iterations performed in each loop must be the same. If it cannot 
be proved that the number of iterations is the same for both loops the Bernstein condi- 
tion set to be false. 

Compare array indices. For each reference to a non-privatizable array in the left 
hand side (LHS) of one loop and for each reference to the same array in another loop 
we compare the array indices. If we can not prove for at least one dimension that indi- 
ces in both references are different then we set BC to be false. The condition can be 
relaxed if we assume the same thread schedule is used for both loops. 



3.5 Variable Usage Analysis 

Properly identifying variable usage is very important for the parallel performance and 
the correctness of program execution. Variables that would cause memory access 
conflict among threads need to be privatized so that each thread will work on a local 
copy. For cases where the privatization is not possible, for instance, an array variable 
would partially be updated by each thread, the variable should be treated as shared 
and the work in the loop or region can only be executed in sequential (except for the 
reduction operation). Private variables are identified by examining the data depend- 
ence information, in particular, output dependence for memory access conflict and 
true dependence for value assignment. Partial updating of variables is checked by 
examining array index expressions. 

With OpenMP, if a private variable needs its initial value from outside a parallel 
region, the FIRSTPRIVATE clause can be used to obtain an initial copy of the origi- 
nal value; if a private variable is used after a parallel region, the LAST PRIVATE 
clause can be used to update the shared variable. 
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The reduction operation is commonly encountered in calculation. A typical imple- 
mentation of parallel reduction has a private copy of each reduction that is first cre- 
ated for each thread, the local value is calculated on each thread, and the global copy 
is updated according to the reduction operator. OpenMP 1 .0 only supports reductions 
for scalar values. For array, we first transform the code section to create a local array 
and, then, update the global copy in a CRITICAL section. 



3.6 Private Variables in Common Blocks 

For a private variable, each thread keeps a local copy and the original shared variable 
is untouched during the course of updating the local copy. If a variable declared in a 
common block is private in a loop, changes made to the variable through a subroutine 
call may not be updated properly for the local copy of this variable. If all the variables 
in the common block are privatizable in the whole program, the common block can be 
declared as threadprivate. However, if the common block can not be thread- 
privatized, additional care is needed to treat the private variable. 

The following algorithm is used to treat private variables in a common block. The 
algorithm identifies and performs the necessary code transformation to ensure the 
correctness of variable privatization. The following convention is used: R_INSIDE 
for routine called inside a parallel loop, R_OUTSIDE for routine called outside a 
parallel loop, R CALL for routine in a call statement, R_CALLBY for routine that 
calls the current routine, and V (or VC, VD, VN) for a variable named in a routine. 

TreatPrivate(y, R_ORIG, callstatement) { 
check V usage in callstatement 

if V is not used in the call (via dependences) | | is on the command parse tree 
I I is not defined in a regular common block in a subroutine along the call path 
return 

TreatVinCall (VC, R_CALL) (V is referred as VC in R_CALL) { 
if VC is in the argument list of R_CALL 
return VC 

if R_CALL is R_OUTSIDE { 

if VC is not declared in R_CALL { 

replicate the common block in which V is named as VN 
set VC to VN from the common block 

set V to VN in the private variable list if R_CALL==R_ORIG 

} 

} else { 

add VC to the argument list of R_CALL 

if VC is defined in a common block of R_CALL 

add R_CALL & VC to RenList (for variable renaming later on) 
else declare VC in R_CALL 
for each calledby statement of R_CALL { 

VD is the name of VC used in R_CALLBY 

TreatVinCall(MD, R_CALLBY) and set VD to the returned value 
add VD to the argument of call statement to R_CALL in R_CALLBY 

} 

for each call statement in R CALL 
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TreatPrivate(VC, R_CALL, callstatement_in_R_CALL) 

} 

return VC 

} 

} 

Rename common block variables listed in RenLi st. 

The algorithm starts with private variables listed for a parallel region in routine 
R_ORIG, one variable at a time. It is used recursively for each call statement in the 
parallel region along the call graph. A list of routine-variable pairs (R_CALL,VC) is 
stored in RenList during the process to track where private variables appear in 
common blocks. These variables in common blocks are renamed at the end. 

As an example in Fig. 2 the private array B is assigned inside subroutine SUB via 
the common block /CSUB/ in loop S2. Applying the above algorithm, the private 
variable B is added as C to the argument list of SUB and the original variable C in the 
common block in SUB is renamed to C_CAP to avoid usage conflict. In this way the 
local copy of B inside loop S2 will be updated properly in subroutine SUB. 



SI common /csub/ b(lOO), 


SI 


COMMON/CSUB/B (100) ,A(100, 100) 


& a(100,100) 


!$OMP 


PARALLEL DO PRIVATE (I , J, B) 


S2 do 10 j=l, ny 


S2 


DO 10 J=l, NY 


S3 call sub(j, nx) 


S3 


CALL SUB(J, NX, B) 


do 10 1=1, nx 




DO 10 1=1, NX 


a (i, j ) = b (i) 




A(1,J) = B(l) 


10 continue 


10 


CONTINUE 




!$OMP 


END PARALLEL DO 


S4 subroutine sub(j, nx) 






S5 common /csub/ c(lOO), 


S4 


SUBROUTINE SUB(J, NX, C) 


& a(100,100) 


S5 


COMMON/CSUB/C_CAP (100) ,A (100, 100) 


c (1) = a (1, j ) 


S6 


DIMENSION C(IOO) 


c (nx) = a (nx, j ) 




do 20 1=2, nx-1 




C(l) = A(1,J) 


c(i) = (a(i+l,j) + 




C (NX) = A (NX, J) 




DO 20 1=2, NX-1 


& a(i-l,j))*0.5 

20 continue 


20 


C(l) = (A(l + 1, J)+A(l-1, J) ) *0.5 


CONTINUE 


end 




END 





Fig. 2. An example of treating a private variable in a common block 



3.7 Routine Duplication 

Routine duplication is performed after all the analyses are done but before directives 
are inserted. A routine needs to be duplicated if it causes usage conflicts at different 
calling points. For example, if a routine contains parallel regions and is called both 
inside and outside other parallel regions, the routine is duplicated so that the original 
routine is used outside parallel regions and the second copy contains only orphaned 
directives without „OMP PARALLEL" and is used inside parallel regions. Routine 
duplication is often used in a message -passing program to handle different data distri- 
butions in the same routine. 
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4 Case Studies 

We have applied CAPO to parallelize the NAS parallel benchmarks and two compu- 
tational fluid dynamics (CFD) codes well known in the aerospace field: ARC3D and 
OVERFLOW. The parallelization started with the interprocedural data dependence 
analysis on sequential codes. This step was the most computationally intensive part. 
The result was saved to an application database for later use. The loop and region 
level analysis was then carried out. At this point, the user inspects the result and de- 
cides if any changes are needed. The user assists the analysis by providing additional 
information on input parameters and removing any false dependences that could not 
be resolved by the tool. This is an iterative process, with user interaction involved. As 
we will see in the examples, the user interaction is nominal. OpenMP directives were 
lastly inserted automatically. 

In the case studies, we used an SGI workstation (R5K, I50MHz) and a Sun El OK 
node to run CAPO. The resulting OpenMP codes were tested on an SGI 0rigin2000 
system, which consisted of 64 CPUs and 16 GB globally addressable memory. Each 
CPU in the system is a RlOK 195 MHz processor with 32KB primary data cache and 
4MB secondary data cache. The SGI’s MIPSpro Fortran 77 compiler (7.2.1) was used 
for compilation with the „-03 -mp“ flag. 



4.1 The NAS Parallel Benchmarks 

The NAS Parallel Benchmarks (NPB) were designed to compare the performance of 
parallel computers and are widely recognized as a standard indicator of computer 
performance. The NPB suite consists of five kernels and three simulated CFD appli- 
cations derived from important classes of aerophysics applications. The five kernels 
mimic the computational core of five numerical methods used by CFD applications. 
The simulated CFD applications reproduce much of the data movement and computa- 
tion found in full CFD codes. Details of the benchmark specifications can be found in 
[2] and the MPI implementations of NPB are described in [3]. 

In this study we used six benchmarks (LU, SP, BT, FT, MG and CG) from the se- 
quential version of NPB2.3 [3] with additional optimization described in [7]. Paralle- 
lization of the benchmarks with CAPO is straightforward except for FT where addi- 
tional user interaction was needed. User knowledge on the grid size (> 6) was entered 
for the data dependence analysis of BT, SP and LU. In all cases, the parallelization 
process for each benchmark took from tens of minutes up to one hour, most of the 
time being spent in the data dependence analysis. The performance of CAPO gener- 
ated codes is summarized in Fig. 3 together with comparison to other parallel versions 
of NPB: MPI ffomNPB2.3, hand-coded OpenMP [7], and versions generated with the 
commercial tool SGI-PFA [17]. 

CAPO was able to locate effective parallelization at the outer-most loop level for 
the three application benchmarks and automatically pipelined the SSOR algorithm in 
LU. As shown in Fig. 3, the performance of CAPO-BT, SP and LU is within 10% to 
the hand-coded OpenMP version and much better than the results from SGI-PFA. The 
SGI-PFA curves represent results from the parallel version generated by SGI-PFA 
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without any change for SP and with user optimization for BT (see [17] for details). 
The worse performance of SGI-PFA simply indicates the importance of accurate in- 
terprocedural dependence analysis that usually cannot be emphasized in a compiler. It 
should be pointed out that the sequential version used in the SGI-PFA study was not 
optimized, thus, the sequential performance needs to be counted for the comparison. 
The hand-coded MPI versions scaled better, especially for LU. We attribute the per- 
formance degradation in the directive implementation of LU to less data locality and 
larger synchronization overhead in the 1-D pipeline used in the OpenMP version as 
compared to the 2-D pipeline used in the MPI version. This is consistent with the 
result of a study from [12]. 



! 

I • 



I ”■ 



Fig. 3. Comparison of the OpenMP NPB generated by CAPO with other parallel versions: MPI 
from NPB2.3, OpenMP by hand, and SGI-PFA 

The basic loop structure for the Fast Fourier Transform (FFT) in one dimension in 
FT is as follows. 

DO 10 K=1,D3 
DO 10 J=1,D2 
DO 20 1=1, D1 
20 Y(l) = X(l, J,K) 

CALL CFFTZ ( . . . , Y) 

DO 30 1=1, D1 
30 X(I, J,K) = Y(l) 

10 CONTINUE 

A slice of the 3-D data (X) is first copied to a 1-D work array (Y). The 1-D FFT rou- 
tine CFFTZ is called to work on Y. The returned result in Y is then copied back to the 
3-D array (X). Due to the complicated pattern of loop limits inside CFFTZ, CAPTools 
could not disprove the loop-carried true dependences by the working array Y for loop 
K. These dependences were deleted by hand in CAPO to identify the K loop as a 
parallel loop. 

The resulted parallel FT code gave a reasonable performance as indicated by the 
curve with filled circles in Fig. 3. It does not scale as well as the hand-coded versions 
(both in MPI and OpenMP), mainly due to the unparallelized code section for the 
matrix creation which was artificially done with random number generators. Restruc- 
turing the code section was done in the hand-coded version to parallelize the matrix 
creation. Again, the SGI-PFA generated code performed worse. 
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The directive code generated by CAPO for MG performs 36% worse on 32 proces- 
sors than the hand-coded version, primarily due to an unparallelized loop in routine 
norm2u3. The loop contains two reduction operations of different types. One of the 
reductions was expressed in an IF statement, which was not detected by CAPO, thus, 
the routine was ran in serial. Although this routine takes only about 2% of the total 
execution time on a single node, it translates into a large portion of the parallel execu- 
tion on large number of processors, for example, 40% on 32 processors. All the par- 
allel versions achieved similar results for CG. 



4.2 ARC3D 

ARC3D is a moderate-size CFD application. It solves Euler and Navier-Stokes equa- 
tions in three dimensions using a single rectilinear grid. ARC3D has a structure similar 
to NPB-SP but contains curve linear coordinates, turbulent models and more realistic 
boundary conditions. The Beam-Warming algorithm is used to approximately factor- 
ize an implicit scheme of finite difference equations, which is then solved in three 
directions successively. 

For generating the OpenMP parallel version of ARC3D, we used a serial code that 
was already optimized for cache performance by hand [16]. The parallelization proc- 
ess with CAPO was straightforward and OpenMP directives were inserted without 
further user interaction. The parallel version was tested on the 0rigin2000.and the 
result for a 194xl94xl94-size problem is shown in the left panel of Fig. 4. The results 
from a hand-parallelized version with SGI multi-tasking directives {MT by hand) [16] 
and a message-passing version generated by CAPTools {CAP MPl) [8] from the same 
serial version are also included in the figure for comparison. 

As one can see from the figure, the OpenMP version generated by CAPO is essen- 
tially the same as the hand-coded version in performance. This is indicative of the 
accurate data dependence analysis and sufficient parallelism that was exploited in the 
outer-most loop level. The MPI version is about 10% worse than the directive-based 
versions. The MPI version uses extra buffers for communication and this could con- 
tribute to the increase of execution time. 



4.3 Overflow 

OVERFLOW is widely used for airflow simulation in the aerospace community. It 
solves compressible Navier-Stokes equations with first-order implicit time scheme, 
complicated turbulence model and Chimera boundary condition in multiple zones. The 
code has been parallelized by hand [6] with several approaches: PVM for zonal-level 
parallelization only, MPI for both inter- and intra-zone parallelization, multi-tasking 
directives, and multi-level parallelization. This code offers a good test case for our 
tool not only because of its complexity but also its size (about lOOK lines of 
FORTRAN 77). 

In this study, we used the sequential version (1.8f) of OVERFLOW. CAPO took 25 
hours on a Sun El OK node to complete the data dependence analysis. A fair amount of 
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effort was spent on pruning data dependences that were placed due to lack of neces- 
sary knowledge during the dependence analysis. An example of false dependence is 
illustrated in the following code segment: 

NTMP2 = JD*KD*31 
DO 100 L=LS,LE 

CALL GETARX (NTMP2 , TMP2 , ITMP2 ) 

CALL WORK(L,TMP2 (ITMP2, 1) ,TMP2 (ITMP2, 7) , . . . ) 

CALL FREARX (NTMP2 , TMP2 , ITMP2 ) 

100 CONTINUE 

Inside the loop nest, the memory space for an array TMP2 is first allocated by 
GETARX. The working array is then used in WORK and freed afterwards. However, the 
dependence analysis has reviewed that the loop contains loop-carried true depen- 
dences caused by variable TMP2, thus, the loop can only be executed in serial. The 
memory allocation and de-allocation are performed dynamically and cannot be han- 
dled by CAPO. This kind of false dependence can safely be removed with the Direc- 
tives Browser included in the tool. Even so, CAPO provides an easy way for user to 
interact with the parallelization process. The OpenMP version was generated within a 
day after the analysis was completed and an additional few days were used to test the 
code. 

The right panel of 
Fig. 4 shows the 
execution time per 
time-iteration of the 
CAPO-OMP version 
compared with the 
hand-coded MPI 
version and hand- 
coded directive 
(MT) version. All 
three versions were 
running with a test 

case of size 69x61x50, 210K grid points in a single zone. Although the scaling is not 
quite linear (when comparing to ARC3D), especially for more than 16 processors, the 
CAPO version out-performed both hand-coded versions. The MPI version contains 
sizable extra codes [6] to handle intra-zone data distributions and communications. It 
is not surprising that the overhead is unavoidably large. However, the MPI version is 
catching up with the CAPO-OMP version on large number of processors. On the other 
hand, further review has indicated that the multi-tasking version used a fairly similar 
parallelization strategy as CAPO did, but in quite number of small routines the MT 
version did not place any directives for the hope that the compiler (SGI-PFA in this 
case) would automatically parallelize loops inside these routines. The performance 
number seemed to have indicated otherwise. 

We also tested with a large problem of 1.5M grid points. The result was not in- 
cluded in the figure but CAPO’s version has achieved 18-fold speedup on 32 proces- 
sors of the Origin2000 (10 out of 32 for the small test case). It is not surprising that 
the problem with large grid size has achieved better parallel performance. 




T ' 

CAPOOMP 
- -1- HT by hand 

' MPI by hand 


OVERFLOW 

69x6b(S0 





Fig. 4. Comparison of execution times of CAPO generated parallel 
codes with hand-coded parallel versions for two CFD applications: 
ARC3D on the left and OVERFLOW on the right 
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5 Related Work 

There are a number of tools developed for eode parallelization on both distributed and 
shared memory systems. The KAPro-toolkit [10] from Kuck and Associates, Inc. 
performs data dependence analysis and automatically inserts OpenMP directives in a 
certain degree. KAI has also developed several useful tools to ensure the correctness 
of directives insertion and help user to profile parallel codes. The SUIF compilation 
system [18] from Standard is a research product that is targeted at parallel code opti- 
mization for shared-memory system at the compiler level. 

The SGI’s MIPSpro compiler includes a tool, PFA, that tries to automatically de- 
tect loop-level parallelism, insert compiler directives and transform loops to enhance 
their performance. SGI-PFA is available on the OriginTOOO. Due to the constraints on 
compilation time, the tool usually cannot perform a comprehensive dependence analy- 
sis, thus, the performance of generated parallel programs is very limited. User inter- 
vention with directives is usually necessary for better performance. For this purpose. 
Parallel Analyzer View (PAV), which annotate the results of dependence analysis of 
PFA and present them graphically, can be used to help user insert directives manually. 
More details of a study with SGI-PFA can be found in [17]. 

VAST/Parallel [14] from Pacific-Sierra Research is an automatic parallelizing pre- 
processor. The tool performs data dependence analysis for loop nests and supports the 
generation of OpenMP directives. 

Parallelization tools like FORGExplorer [1] and CAPTools [4] emphasize the gen- 
eration of message passing parallel codes for distributed memory systems. These tools 
can easily be extended to handle parallel codes in the shared-memory arena. Our work 
is such an example. As discussed in previous sections, the key to the success of our 
tool is the ability to obtain accurate data dependences combined with user guidance. 
An ability to handle large applications is also important. 



6 Conclusion and Future Work 

In summary, we have developed the tool CAPO that automatically generates directive- 
based parallel programs for shared memory machines. The tool has been successfully 
used to parallelize the NAS parallel benchmarks and several CFD applications with 
CAPO, as summarized in Table 1 which included also information for another CFD 
code, INS3D, the tool was applied to. 

By taking advantage of the intensive data dependence analysis from CAPTools, 
CAPO has been able to produce parallel programs with performance close to hand- 
coded versions in a relatively short period of time. It should be pointed out, however, 
that the results did not show the effort in cache optimization of the serial code, such as 
for ARC3D. Our approach is different from parallel compilers in that it spends much 
of its time on whole program analysis to discover accurate dependence information. 
The generated parallel code is produced using a source-to-source transformation with 
very little modification to the original code and, therefore, is easily maintainable. 
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Table 1. Summary of CAPO applied on six NPBs and three CFD applications 



Application 


BT,SP,LU 


FT,CG,MG 


ARC3D 


OVERFLOW 


INS3D 


Code Size 


~3000 lines 
benchmark 


~2000 lines 
benchmark 


-4000 

lines 


85 1 routines 
lOOK lines 


256 routines 
41K lines 


Code Analysis 


30 mins to 
1 hour 


10 mins to 
30 mins 


40 mins 


25 hours 


42 hours 


Code Generation 

b) 


< 5 mins 


< 5 mins 


< 5 mins 


1 day 


2 days 


Testing 


1 day 


1 day 


1 day 


3 days 


3 days 


Performance 
Compared to 
Hand-coded 
Version 


within 5- 
10% 


within 
10% for CG 
30-36% for 
FT, MG 


within 

6% 


slightly better 
(see text in 
Sect. 4.3) 


no hand- 
coded par- 
allel version 



a) „ Code Analysis “ refers to wall-clock time spent on the data dependence analysis, for 
NPB and ARC3D on an SGI Indy workstation and for OVERFLOW and 1NS3D on a 
Sun El OK node. 

b) „ Code Generation “ includes time user spent on interacting with the tool and code 
restructuring by hand (only for 1NS3D in four routines). The restructure involves 
mostly loop interchange and loop fuse that cannot be done by the tool. 

c) „ Testing" includes debugging and running a code and collecting results. 



For larger and more complex applications such as OVERFLOW, it is our experi- 
ence that the tool will not be able to generate efficient parallel codes without any user 
interactions. The importance of a tool, however, is its ability to quickly pinpoint the 
problematic codes in this case. CAPO (via Directives Browser) was able to point out a 
small percentage of code sections where user interactions were required for the test 
cases. 

Future work will be focused in the following areas: 

• Include a performance model for optimal placement of directives. 

• Apply data distribution directives (such as those defined by SGI) rather than re- 
lying on the automatic data placement policy, First-Touch, by the operating 
system to improve data layout and minimize number of costly remote memory 
reference. 

• Develop a methodology to work in a hybrid approach to handle parallel applica- 
tions in a heterogeneous environment or a cluster of SMP’s. Exploiting multi- 
level parallelism is important. 

• Develop an integrated working environment for sequential optimization, code 
transformation, code parallelization, and performance analysis. 
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Abstract. This paper describes automatic coarse grain parallel process- 
ing on a shared memory multiprocessor system using a newly developed 
OpenMP backend of OSCAR multigrain parallelizing compiler for from 
single chip multiprocessor to a high performance multiprocessor and a 
heterogeneous supercomputer cluster. OSCAR multigrain parallelizing 
compiler exploits coarse grain task parallelism and near fine grain paral- 
lelism in addition to traditional loop parallelism. The OpenMP backend 
generates parallelized Fortran code with OpenMP directives based on an- 
alyzed multigrain parallelism by middle path of OSCAR compiler from 
an ordinary Fortran source program. The performance of multigrain par- 
allel processing function by OpenMP backend is evaluated on an off the 
shelf eight processor SMP machine, IBM RS6000. The evaluation shows 
that the multigrain parallel processing gives us more than 2 times speed 
up compared with a commercial loop parallelizing compiler, IBM XL 
Fortran compiler, on the SMP machine. 



1 Introduction 

Automatic parallelizing compilers have been getting more important with the 
increase of parallel processing in a high performance multiprocessor system and 
use of multiprocessor architecture inside a single chip and for an upcoming home 
server for improving effective performance, cost-performance and ease of use. 
Current parallelizing compilers exploit loop parallelism, such as Do-all and Do- 
across[33,3]. In these compilers. Do-loops are parallelized using various data de- 
pendency analysis techniques [4,25] such as GCD, Banerjee’s inexact and exact 
tests [33,3], OMEGA test[28], symbolic analysis[9], semantic analysis and dy- 
namic dependence test and program restructuring techniques such as array priva- 
tization[31], loop distribution, loop fusion, strip mining and loop interchange [32, 
23]. 

For example, Polaris compiler[26, 6, 29] exploits loop parallelism by using 
inline expansion of subroutine, symbolic propagation, array privatization [31,6] 
and run-time data dependence analysis[29]. PROMIS compiler[27, 5] combines 
Parafrace2 compiler [24] using HTG[8] and symbolic analysis techniques [9], and 
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EVE compiler for fine grain parallel processing. SUIE compiler parallelizes loop 
by using inter-procedure analysis [10, 11, 1], unimodular transformation and data 
locality optimization[20, 2]. 

Effective optimization of data localization is more and more important be- 
cause of the increasing a speed gap between memories and processors. Also, many 
researches for data locality optimization using program restructuring techniques 
such as blocking, tiling, padding and data localization, are proceeding for high 
performance computing and single chip multiprocessor systems [20,12,30,35]. 

OSCAR compiler has realized a multigrain parallel processing [7, 22, 19] that 
effectively combines the coarse grain task parallel processing [7, 22, 19, 16, 13, 15], 
which can be applied from a single chip multiprocessor to HPC multiprocessor 
systems, the loop parallelization and near fine grain parallel processing[17] . In the 
conventional OSCAR compiler with the backend for OSCAR architecture, coarse 
grain tasks are dynamically scheduled onto processors or processor clusters to 
cope with the runtime uncertainties by the compiler. As the task scheduler, 
the dynamic scheduler in OSCAR Eortran compiler, and distributed dynamic 
scheduler[21] have been proposed. 

This paper describes the implementation scheme of a thread level coarse grain 
parallel processing on a commercially available SMP machine and its perfor- 
mance. Ordinary sequential Eortran programs are parallelized using by OSCAR 
compiler with newly developed OpenMP backend automatically and a paral- 
lelized program with OpenMP directive is generated. In other words, OSCAR 
Eortran compiler is used as a preprocessor which transforms a Eortran program 
into a parallelized OpenMP Eortran. Parallel threads are forked only once at 
the beginning of the program and joined only once at the end in this scheme to 
minimize fork/join overhead. Also, this OSCAR OpenMP backend realizes hier- 
archical coarse grain parallel processing only using ordinary OpenMP directives 
though NANOS Compiler uses customly made n-thread library[34]. 

The rest of this paper is composed as follows. Section 2 introduces the ex- 
ecution model of the thread level coarse grain task parallel processing. Section 
3 shows the coarse grain parallelization in OSCAR compiler. Section 4 shows 
the implementation method of the multigrain parallel processing in OpenMP 
backend. Section 5 evaluates the performance of this method on IBM RS6000 
SP 604e High Node for several programs like Perfect Benchmarks and SPEC 
95fp Benchmarks. 



2 Execution Model of Coarse Grain Task Parallel 
Processing in OSCAR OpenMP Backend 

This section describes the coarse grain task parallel processing using OpenMP di- 
rectives. Coarse grain task parallel processing uses parallelism among three kinds 
of macro-tasks(MTs), namely, Basic Block(BB), and Repetition Block(RB), Sub- 
routine Block(SB) described in Section 3. Macro-tasks are generated by decom- 
position of a source program and assigned to threads or thread groups and 
executed in parallel. 
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In the coarse grain task parallel processing using OSCAR OpenMP backend, 
threads are generated only once at the beginning of the program, and joined only 
once at the end. In other words, OSCAR OpenMP backend realizes hierarchical 
coarse grain parallel processing without hierarchical child thread generation. For 
example, in Fig.l, four threads are generated at the beginning of the program, 
and all generated threads are grouped to one thread group (groupO). Thread 
groupO executes MTl, MT2 and MTS. 



groupO 




c r — jz a represents thread grouping 



Fig. 1. execution image 



When thread group executes a MT, threads in the group use parallelism in- 
side a MT. For example, if MT is a parallelizable loop, threads in group use 
parallelism among loop iteration. In Fig.l, a parallelizable loop MT2 is dis- 
tributed to four threads in the group. Also, nested parallelism among sub-MTs, 
which are generated by decomposition of body of a MT, is used. Sub MTs are as- 
signed to nested(lower level) thread groups, that are hierarchically defined inside 
a upper level thread group. For example, MTS in the Fig.l is decomposed into 
sub-MTs(MTS_l,MTS_2 and MTS_S), and sub-MTs are executed by two nested 
thread groups, namely group0_0 and groupO_l, each of which have two threads 
respectively. These groups are defined inside thread groupO which execute MTS. 

3 Coarse Grain Parallelization in OSCAR Compiler 

This section describes the analysis of OSCAR compiler for coarse grain task 
parallel processing. First, OSCAR compiler defines coarse grain macro-tasks 
from source program, and analyzes parallelism among macro-tasks. Next, the 
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Fig. 2. Macro Flow Graph and Macro Task Graph 



generated MTs are scheduled to thread groups statically at compile time or 
dynamically by embedded scheduling code generated by compiler. 



3.1 Definition of Coarse Grain Task 



In the coarse grain task parallel processing, a source program is decomposed into 
three kinds of MTs, namely, BB, RB and SB as mentioned above. Generated 
MTs are assigned to thread groups, and executed in parallel by threads in the 
thread group. 

If a generated RB is a parallelizable loop, parallel iterations are distributed 
onto threads inside thread group considering cache size. 

If a RB is a sequential loop having large processing cost or SB, it is decom- 
posed into sub-macro-tasks and hierarchically processed by coarse grain task 
parallel processing scheme like MTS in Fig.l. 
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3.2 Generation of Macro-Flow Graph 

After generation of macro-tasks, the data dependency and control flow among 
MTs for each layer are analyzed hierarchically, and represented by Macro-Flow 
Graph(MFG) as shown in Fig. 2(a). 

In the Fig. 2, nodes represent MTs, solid edges represent data dependencies 
among MTs and dotted edges represent control flow. A small circle inside a 
node represents a conditional branch inside the MT. Though arrows of edges are 
omitted in the MFG, it is assumed that the directions are downward. 



3.3 Generation of Macro-Task Graph 

To extract parallelism among MTs from MFG, Earliest Executable Condition 
analysis considering data dependencies and control dependencies is applied. Ear- 
liest Executable Condition represents the conditions on which MT may begin its 
execution earliest. It is obtained assuming the following conditions. 

1. If MTi data-depends on MTj, MTi can not begin execution before MTj 
finishes execution. 

2. If the branch direction of MTj is determined, MTi that control-depends on 
MTj can begin execution even though MTj has not completed its execution. 

Then, the original form of Earliest Execution Condition is represented as 
follows; 

(MTj, on which MTi is control dependent, branches to MTi) AND 
(MTA:(0<A:<|A'|), on which MTi is data dependent, completes execution OR 
it is determined that MTA: is not be executed), where N is the number of 

predecessors of MTi 

Eor example, the original form of Earliest Execution Condition of MT6 on 
Eig.2(b) is 



(MTl branches to MTS OR MT2 branches to MT4) AND 
(MTS completes execution OR MTl branches to MT4). 

However, the completion of MTS means that MTl already branched to MTS. 
Also, “MT2 branches to MT4” means that MTl already branched to MT2. 
Therefore, this condition is redundant and its simplest form is 

(MTS completes execution OR MT2 branches to MT4). 

Earliest Execution Condition of MT is represented in Macro-Task Graph(MTG) 
as shown in Eig.2(b). 

In MTG, nodes represent MTs. A small circle inside nodes represents condi- 
tional branches. Solid edges represent data dependencies. Dotted edges represent 
extended control dependencies. Extended control dependency means ordinary 
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normal control dependency and the condition on which a data dependence pre- 
decessor of MTi is not executed. 

Solid and dotted arcs connecting solid and dotted edges have two different 
meanings. A solid arc represents that edges connected by the arc are in AND 
relationship. A dotted arc represents that edges connected by the arc are in OR 
relation ship. 

In MTG, though arrows of edges are omitted assuming downward, an edge 
having arrow represents original control flow edges, or branch direction in MFC. 

3.4 Scheduling of MTs to Thread Groups 

In the coarse grain task parallel processing, the static scheduling and the dynamic 
scheduling are used for assignment of MTs to thread groups. 

In the dynamic scheduling, MTs are assigned to thread groups at runtime to 
cope with runtime uncertainties like conditional branches. The dynamic schedul- 
ing routine is generated and embedded into user program by compiler to elimi- 
nate the overhead of OS call for thread scheduling. Though generally dynamic 
scheduling overhead is large, in OSCAR compiler the dynamic scheduling over- 
head is relatively small since it is used for the coarse grain tasks with relatively 
large processing time. 

In static scheduling, assignment of MTs to thread groups is determined at 
compile-time if MTG has only data dependency edges. Static scheduling is use- 
ful since it allows us to minimize data transfer and synchronization overheard 
without run-time scheduling overhead. 

In the proposed coarse grain task parallel processing, both scheduling schemes 
are selectable for each hierarchy. 



4 Code Generation in OpenMP Backend 

This section describes a code generation scheme for the coarse grain task parallel 
processing using threads in OpenMP backend of OSCAR multigrain automatic 
parallelization compiler. 

The code generation scheme is different for each scheduling scheme. There- 
fore, after a thread generation method is explained, the code generation scheme 
for each scheduling scheme is described. 

4.1 Generation of Threads 

In the proposed coarse grain task parallel processing using OpenMP, the same 
number of threads as the number of processors are generated by PARALLEL 
SECTIONS directive only once at the beginning of the execution of program. 

Generally, to realize nested or hierarchical parallel processing, nested threads 
are forked by an upper level thread. However, in the proposed scheme, it is as- 
sumed that the number of generated thread, thread grouping and the scheduling 
scheme applied to each hierarchy are determined at compile-time. In other words. 
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the proposed scheme realizes this hierarchical parallel processing with single level 
thread generation by writing all MT code or embedding hierarchical scheduling 
routines in each OpenMP SECTION between PARALLEL SECTIONS and END 
PARALLEL SECTIONS. 

This scheme allows us to minimize thread fork and join overhead and to 
implement hierarchical coarse grain parallel processing without special extension 
of OpenMP. 



4.2 Static Scheduling 

If a Macro Task Graph in a target layer has only data dependencies, the static 
scheduling is applied to reduce data transfer, synchronization and scheduling 
overheads. 

In the static scheduling, the assignment of MTs to thread groups is deter- 
mined at compile-time. Therefore, each OpenMP SECTION needs only the MTs 
that should be executed in the predetermined order. 

At runtime, each thread group should synchronize and transfer shared data to 
other thread groups in the same hierarchy to satisfy the data dependency among 
MTs. Therefore, the compiler generates synchronization codes using shared mem- 
ory. 

A code image for eight threads generated by OpenMP backend of OSCAR 
compiler is shown in Pig. 3. In this example, static scheduling is applied to the first 
layer. In Pig. 3, eight threads are generated by OpenMP PARALLEL SECTIONS 
directives. The eight threads are grouped into two thread groups, each of which 
has four threads. MTl and 3 are statically assigned to thread groupO and MT2 
is assigned to thread group 1. 

When static scheduling is applied, compiler generates different codes for the 
thread groups which include only task codes assigned to the thread group. 

The assigned MTs to thread groups are processed in parallel by threads inside 
the thread group by using static scheduling or dynamic scheduling hierarchically. 



4.3 Dynamic Scheduling 

Dynamic scheduling is applied for a Macro Task Graph with runtime uncertainty 
caused by a conditional branch. In the dynamic scheduling, since each thread 
group has possibility to execute any MTs, the all MT codes are copied to every 
OpenMP SECTION. Each thread group executes MTs selectively according to 
the scheduling result. 

Por the dynamic scheduling, OpenMP backend can generate centralized sched- 
uler codes or distributed scheduler codes to be embedded into user code for any 
parallel processing layer, or nested level. In Pig3, MT2 assigned onto thread 
group 1 is processed by four threads in parallel using the centralized scheduler. 
In the centralized scheduler method, a master thread, assigned to a thread, as- 
signs macro-tasks to the other three slave threads. 

The master thread repeats the following steps. 
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Thread 0 Thread 1 Thread 2 Thread 3 Thread 0 Thread 1 Thread 2 Thread 3 

!$OMP SECTION ! $OMP SECTION ! $OMP SECTION ! $OMP SECTION ! $OMP SECTION ! $OMP SECTION !$OMP SECTION !$OMP SECTION 

groupO groupl 

MT2(SB) 



MT1 (parallelizable loop) 


MT1_1(RB) 
(partial loop) 


MT1_2(RB) 
(partial loop) 


MT1_3(RB) 
(partial loop) 


MT1_4(RB) 
(partial loop) 






MT3(RB) 






DO oroupO 0 


DO qroupO 1 




First layer : MT1, MT2, MT3 : static 
2nd Layer : MT2_1, 2_2, ... : centralized dynamic 

: MT3_1 , 3_2, ... : disributed dynamic 

3rd Layer : MT3_1a, 3_1b, ... : static 



Fig. 3. Code image (four threads) 



Behavior of Master thread 

stepl Searc hexecutable, or ready, MTs of which Earliest Executable Condi- 
tion(EEC) are satisfied by the completion or a branch of the preceding MT 
and enqueue the ready MTs to the ready queue. 
step2 Choose a MT with highest priority and assigned it to a idle slave thread. 
step3 Go back to stepl 

The behavior of slave threads is summarized in follcwing. 

Behavior of Slave thread 

stepl Wait for the macro-task assignment by master thread. 
step2 Execute assigned macro-task. 

step3 Send signals to report to the master thread a branch direction and/or 
completion of the task execution. 
step4 Go back to stepl. 

Also, the compiler generates a special MT called EndMT(EMT) in all OpenMP 
SECTIONS in each hierarchy. The assignmeit of EndMT shows the end of its 





Coarse-Grain Task Parallel Processing 



465 



hierarchy. In other words, if a EndMT is scheduled to thread groups, the groups 
finish execution of a hierarchy. As shown in the second layer in Fig. 3, the EndMT 
is written at the end of layer. 

In Fig. 3, it is assumed that MT2 is executed by master thread(thread 4) and 
three slave threads. 

Next, MT3 shows an example of distributed dynamic scheduling. In this 
case, MT3 is decomposed into sub-macro-tasks and assigned thread group0_0 
and 0_1 defined inside thread groupO. In this example, the thread groupO.O and 
0_1 has two threads. Each thread group works as scheduler, which behave same 
as master thread described before, though distributed dynamic schedulers need 
mutual exclusion to access the shared scheduling data like EEC and ready queue. 

The distributed dynamic scheduling routines are embedded into before each 
macro-task code as shown in Fig. 3. Furthermore, Fig. 3 shows MT3_1, 3_2 and 
so on are processed by two threads inside thread group0_0, or 0_1. 



5 Performance Evaluation 

This section describes the performance of coarse grain task parallelization by 
OSCAR Fortran Compiler for several programs in Perfect benchmarks and SPEC 
95fp benchmarks on IBM RS6000 SP 604e High Node 8 processor SMP. 

5.1 OSCAR Fortran Compiler 

Fig. 4 shows the overview of OSCAR Fortran Compiler. It consists of Front 
End(FE), Middle Path(MP) and Back Ends(BE). OSCAR Fortran Compiler 
has various Back Ends for different target multiprocessor systems like OSCAR 
distributed/shared memory multiprocessor system[18], Fujitsu’s VPP supercom- 
puter, UltraSparc, PowerPC, MPI-2 and OpenMP. The newly developed OpenMP 
Backend used in this paper, generates the parallelized Fortran source code with 
OpenMP directives. In other words, OSCAR Fortran Compiler is used as a 
preprocessor that transforms from an ordinary sequential Fortran program to 
OpenMP Fortran program for SMP machines. 



5.2 Evaluated Programs 

The programs used for performance evaluation are ARC2D in Perfect Bench- 
marks, SWIM, TOMCATV, HYDR02D, MGRID in SPEC 95fp Benchmarks. 
ARC2D is an implicit finite difference code for analyzing fluid flow problems and 
solves Euler equations. SWIM solves the system of shallow water equations us- 
ing finite difference approximations. TOMCATV is a vectorized mesh generation 
program. HYDR02D is a vectorizable Fortran program with double precision 
floating-point arithmetics. MGRID is the Multi-grid solver in 3D potential field. 
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Fig. 4. Overview of OSCAR Fortran Compiler 



5.3 Architecture of IBM RS6000 SP 

RS6000 SP 604e High Node used for the evaluation is a SMP server having eight 
PowerPC 604e (200 MHz). Each processor has 32 KB LI instruction and data 
caches and 1 MB L2 unified cache. The shared main memory is 1 GB. 

5.4 Performance on RS6000 SP 604e High Node 

In this evaluation, a coarse grain parallelized program automatically generated 
by OSCAR compiler is compiled by IBM XL Fortran compiler version 5.1[14] 
and executed on 1 through 8 processors of RS6000 SP 604e High Node. The per- 
formance of OSCAR compiler is compared with IBM XL automatic parallelizing 
Fortran compiler. In the compilation by a XL Fortran, maximum optimization 
option “-qsmp=auto -03 -qmaxmem=-l -qhot” is used. 

Fig. 5(a) shows speed-up ratio for ARC2D by the proposed coarse grain task 
parallelization scheme by OSCAR compiler and the automatic loop paralleliza- 
tion by XL Fortran compiler. The sequential processing time for ARC2D was 
77.5s and parallel processing time by XL Fortran version 5.1 compiler using 
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Fig. 5. Speed-up of several benchmarks on RS6000 



8PEs was 60.1s. On the other hand, the execution time of coarse grain parallel 
processing using 8 PEs by OSCAR Eortran compiler with XL Eortran compiler 
was 23.3s. In other words, OSCAR compiler gave us 3.3 times speed up against 
sequential processing time and 2.6 times speed up against XL Eortran compiler 
for 8 processors. 

Next, Eig.5(b) shows speed-up ratio for SWIM. The sequential execution 
time of SWIM was 551s. While the automatic loop parallel processing time 
using 8 PEs by XL Eortran needed 112.7s , coarse grain task parallel processing 
by OSCAR Eortran compiler required only 61.1s and gave us 9.0 times speed-up 
by the effective use of distributed caches. 

Eig.5(c) shows speed-up ratio for TOMCATV. The sequential execution time 
of TOMCATV was 691s. The parallel processing time using 8 PEs by XL Eortran 
was 484s and 1.4 times speed-up. On the other hand, the coarse grain parallel 
processing using 8 PEs by OSCAR Eortran compiler was 154s and gave us 4.5 
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times speed-up against sequential execution time. OSCAR Fortran compiler also 
gave us 3.1 times speed up compared with XL Fortran compiler using 8 PEs. 

Fig. 5(d) shows speed-up in HYDR02D. The sequential execution time of 
Hydro2d was 1036s. While XL Fortran gave us 4.7 times speed-up (221s) using 
8 PEs compared with the sequential execution time, OSCAR Eortran compiler 
gave us 8.1 times speed-up (128s). 

Einally, Eig.5(e) shows speed-up ratio for MGRID. The sequential execution 
time of MGRID was 658s. Eor this application, XL Eortran compiler attains 4.2 
times speed-up, or processing time of 157s, using 8 PEs. Also, OSCAR compiler 
achieved 6.8 times speed up, or 97.4s. 

OSCAR Eortran Compiler gives us scalable speed-up and more than 2 times 
speed up for the evaluated benchmark programs compared with XL Eortran 
compiler 



6 Conclusions 

This paper has described performance of coarse grain task parallel processing us- 
ing OpenMP backend of OSCAR multigrain parallelizing compiler. The OSCAR 
compiler generates a parallelized Eortran program using the OpenMP backend 
from a sequential Eortarn program. Though OSCAR compiler can exploit hi- 
erarchical multigrain parallelism, such as coarse grain task level, loop iteration 
level and statement level near fine grain task level, two kinds of parallelism, 
namely, the coarse grain task level and loop iteration level parallelism are ex- 
amined in this paper considering machine performance parameters for the used 
eight processors SMP machine IBM RS6000 604e High Node. 

The evaluation shows that OSCAR compiler gives us more then 2 times 
speedup compared with IBM XL Eortran compiler version 5.1 for several bench- 
mark programs, such as Perfect Benchmarks ARC2D, spec95fp TOMCATV, 
SWIM, HYDR02D, and MGRID. 

The authors are planning to evaluate the performance of coarse grain paral- 
lel processing on various shared memory multiprocessor systems including SGI 
2100, Sun Enterprise 3000 and so on using the developed OpenMP backend. 
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Abstract. This paper shows several optimization techniques in OpenMP 
and investigates their impact using the MGCG method. MGCG is im- 
portant for not only an efficient solver but also benchmarking since it 
includes several essential operations for high-performance computing. We 
evaluate several optimizing techniques on an SGI Origin 2000 using the 
SGI MIPSpro compiler and the RWCP Omni OpenMP compiler. In the 
case of the RWCP Omni OpenMP compiler, the optimization greatly im- 
proves performance, whereas for the SGI MIPSpro compiler, it does not 
affect very much though the optimized version scales well up to 16 pro- 
cessors with a larger problem. This impact is examined by a light-weight 
profiling tool bundled with the Omni compiler. We propose several new 
directives for further performance and portability of OpenMP. 



1 Introduction 



OpenMP is a model for parallel programming that is portable across shared 
memory architectnres from different vendors[4, 5]. Shared memory architectnres 
have become a popnlar platform and are commonly nsed in nodes of clnsters 
for high-performance compnting. OpenMP is accepted for portable specification 
across shared memory architectnres. 

OpenMP can parallelize a seqnential program incrementally. To parallelize 
a do loop in a serial program, it is enongh to insert a parallel region constrnct 
as an OpenMP directive jnst before the loop. Seqnential execntion is always 
possible nnless a compiler interprets OpenMP directives. 

This paper investigates how to optimize OpenMP programs and its im- 
pact. Each directive of OpenMP may canse an overhead. The overhead of each 
OpenMP directive can be measnred by, for example the EPOC OpenMP micro- 
benchmarks[2], however, the important thing is how these overhead affect the 
performance of applications. In this paper, we focns the impacts of the OpenMP 
sonrce-level optimization to rednce overhead by creating and joining a parallel 
region and eliminating nnnecessary barrier for the MGCG method. 

M. Valero et al. (Eds.): ISHPC 2000, LNCS 1940, pp. 471-481, 2000. 
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2 MGCG Method 

The MGCG method is a conjugate gradient (CG) method with a multigrid (MG) 
preconditioner and is quite efficient for the Poisson equation with severe coeffi- 
cient jumps[3, 7, 8]. This method is important for not only several applications 
but also benchmarking because MGCG includes several important operations 
for high-performance computing. 

MGCG consists of MG and CG. MG exploits several sizes of meshes from 
a fine-grain mesh to a coarse-grain mesh. It is necessary to efficiently execute 
parallel loops with various loop lengths. CG needs an inner product besides a 
matrix- vector multiply and a daxpy. It is also necessary to efficiently execute a 
reduction in parallel. 

The major difficulties for the parallelization of MGCG are the matrix- vector 
multiply, the smoothing method, restriction and prolongation. In this paper, we 
use only regular rectangular meshes and a standard interpolation. That is for 
not only avoiding complexity but also its wide applicability for applications such 
as computational fluid dynamics, plasma simulation and so on. Note that MG 
with standard interpolation results in a divergence when a coefficient has severe 
jumps, while MGCG converges efficiently. 

3 Optimization Techniques in OpenMP 

Parallelization using OpenMP is basically incremental and straightforward. All 
we should do is to determine the time-consuming and parallelizable do (or for) 
loops and to annotate them by an OpenMP directive for parallel execution. For 
example, the following code is parallelized using the PARALLEL DO directive and 
REDUCTION clause. 

IP = 0.0 

!$0MP PARALLEL DO REDUCTION (+: IP) 

DO I = 1, N 

IP = IP + A(I) * B(I) 

END DO 

This easy parallelization may cover a large number of programs including dusty 
deck codes and achieve moderate performance, while the performance strongly 
depends on the program, compiler and platform. Generally, it is necessary that 
a parallelized loop spends most of the execution time and it has enough loop 
length or plenty of computation to hide an overhead of OpenMP parallelization 
overhead, thread management, synchronization and so on. 

To get much higher performance in OpenMP, reducing these parallelization 
overheads are necessary. In order to reduce these overheads, the following opti- 
mizations are necessary. 

1. Join several parallel regions. Each parallel region construct starts parallel 
execution that may contain an overhead of thread creation and join. At the 
extreme, it is desirable for an entire program to be one big parallel region. 
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2. Eliminate unnecessary barriers. Each work-sharing construct implies a bar- 
rier (and a flush) at the end of the construct unless HOWAIT is specifled. Note 
that the BARRIER directive is necessary to ensure that each thread accesses 
updated data. 

3. Privatize variables if possible by specifying PRIVATE, FIRSTPRIVATE and 
LASTPRIVATE. 

4. Determine a trade-off of parallel execution including parallelizing overhead 
and sequential execution. Parallel execution of a loop without enough loop 
length may be slower than serial execution, since parallel execution needs an 
overhead of thread management and synchronization. For several indepen- 
dent loops with short loop length, each loop may be parallelized using the 
SECTIONS construct. 



Fig. 1 is an excerpt of the MGCG method written in OpenMP. In this case, a 
parallel region includes three work-sharing constructs and one subroutine call. 
The subroutine MG called in the parallel region, includes several work-sharing 
constructs as an orphaned directive. Orphaned directives are interpreted as a 
work-sharing construct when the subroutine is called within a parallel region. 

The second work-sharing construct computes an inner product, and it in- 
cludes a REDUCTION clause. The shared variable RR2 is summed up at the end 
of this work-sharing construct. In most cases, this kind of computation is a 
subroutine call. In this case, the subroutine includes an orphaned directive as 
follows: 



SUBROUTINE INNER_PRODUCT(N, Rl, R, RR2) 
!$0MP DO REDUCTI0N(+:RR2) 

DO I = 1, N 

RR2 = RR2 + Rl(I) * R(I) 

END DO 
RETURN 
END 



Since it is necessary to perform a reduction within the subroutine, RR2 should 
not be privatized in the parallel region. 

The MGCG method is parallelized in two ways. One is a version that is paral- 
lelized naively. Basically, almost all DO loops are parallelized using a PARALLEL DO 
directive. We call this a naive version. The other is an optimized version that 
exploits the above optimization techniques. In the optimized version, the entire 
MGCG program is a parallel region. In Fig. 1, the first and the last statements 
are in a serial region. The optimized version uses a MASTER directive followed by 
a BARRIER directive to execute serially within the parallel region. This version 
also eliminates all unnecessary barriers by specifying NOWAIT. 




474 



Osamu Tatebe et al. 



RR2 = 0 



!$0MP PARALLEL PRIVATE (BETA) 

* /***** HG Preconditioning ****♦/ 

!$0MP DO 

DO I = 1, I 

R1(I) = 0.0 
EID DO 
CALL MG(. . .) 

* beta = (new_rl, new_r) / (rl, r) ***♦*/ 

!$0MP DO REDUCTI0I(+:RR2) 

DO I = 1, I 

RR2 = RR2 + R1(I) * R(I) 

EID DO 

BETA = RR2 / RRl 

* /****♦ p = rl + beta p ****♦/ 

!$0MP DO 

DO I = 1, I 

P(I) = Rl(I) + BETA * P(I) 

EID DO 

!$OMP EID PARALLEL 
RR 1 = RR2 



Fig. 1. An excerpt of the MGCG method in OpenMP 



4 Evaluation of Optimization Impact 

4.1 Overhead of Creating a Parallel Region 

To investigate an overhead of a parallel region construct and a work-sharing con- 
struct, we consider two kinds of programs, one has several parallel regions (Fig. 2) 
and the other has one large parallel region with several work-sharing constructs 
(Fig. 3). For small N and large M, the program of Fig. 2 shows an overhead to 
create and join a parallel region including an overhead of a work-sharing con- 
struct and that of Fig. 3 shows an overhead of a work-sharing construct. We 
evaluate these programs on an SGI Origin 2000 with 16 RIOOOO 195MHz pro- 
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DO J = 1, M 
!$0MP PARALLEL DO 

DO I = 1, I 

A(I) = A(I) + 2 * B(I) 
END DO 
END DO 



Fig. 2. A naive program with several parallel regions 



!$0MP PARALLEL 

DO J = 1, M 
!$0MP DO 

DO I = 1, N 

A(I) = A(I) + 2 * B(I) 
END DO 
END DO 

!$0MP END PARALLEL 



Fig. 3. An optimized program with one parallel region and several work-sharing 
constrncts 



Table 1. Overhead of parallel region constrnct and work-sharing constrnct when 
N — 8 and M — 100000 nsing eight threads 





naive 


optimized 


MIPSpro 

Omni 


2.90 

46.76 


2.07 

2.86 



[sec.] 



cessors with a developing version of the RWCP Omni OpenMP compiler]!, 6] 
and the SGI MIPSpro compiler version 7.3. The cnrrent version 1.1 of the Omni 
compiler only snpports pthreads for the IRIX operating system; however, this 
developing version exploits sproc for further efficient execution on IRIX. The 
sproc system calls create a new process whose virtual address space can be shared 
with the parent process. 

Table 1 shows the overhead of parallel and work-sharing constructs when 
N — 8 and M — 100000 using eight threads. Overhead for a parallel region 
construct is quite small using the MIPSpro compiler as also reported by EPOC 
OpenMP microbenchmarks, while it is quite large using the Omni OpenMP 
compiler. The Omni compiler basically translates an OpenMP program to a C 
program using the Omni runtime libraries. To reduce the overhead for the parallel 
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Fig. 4. Snapshot of tlogview for the naive version 




Fig. 5. Snapshot of tlogview for the optimized version 



region construct, it is necessary to improve the performance of the runtime 
library for Origin 2000. 



4.2 Performance Evaluation of MGCG Method 

Profiling The RWCP Omni OpenMP compiler provides a very light-weight pro- 
filing tool and a viewer to investigate a behavior of parallel execution. All events 
related to OpenMP directives are time-stamped at runtime. Each event consists 
of only two double words, and the profiled log can be viewed by tlogview. 

Naive and optimized versions of MGCG are profiled on an SGI Origin 2000 
with the developing version of the Omni compiler. Figures 4 and 5 show profiled 
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data of the MGCG program by tlogview. The white region denotes a parallel 
execntion, the red (dark gray) region shows a barrier, the green (light gray) 
region shows a loop initialization, and the black region shows a serial part. The 
MGCG method ronghly consists of two parts; CG and MG. The CG part inclndes 
a daxpy operation, an inner prodnct and a matrix-vector mnltiply on the finest 
grid. Since each loop length is long, the CG part is white. The MG part inclndes 
a smoothing method (Red-Black Ganss-Seidel), restriction and prolongation in 
each grid level. Since MG needs operations on coarse grids that do not have long 
loop length, the MG part is almost green and red. 

Fignres 6 and 7 are snapshots of a finer scale. The npper pictnre of Fig. 6 
shows the CG part and the lower pictnre shows the MG part. These two pictnres 
show almost one iteration of MGCG. In this scale, a serial part stands ont. 
Particnlarly, an overhead of MG part is conspicnons, since the MG part has 
loops with varions loop length from long length to short length. On the other 
hand, the optimized version snccessfnlly rednces these overheads and serial part. 



Performance Evaluation We evalnate naive and optimized MGCG programs 
nsing the Omni OpenMP compiler and the SGI MIPSpro compiler. We also 
compare these OpenMP programs with a program written in MPI. The MPI 
library is the SGI Message-passing toolkit version 1.3. Every program is compiled 
with the -Ofast optimization option. 

Each program solves a two-dimensional Poisson eqnation with severe coeffi- 
cient jnmps. The two-dimensional nnit domain is discretized by 512 x 512 meshes 
and 1024 x 1024 meshes. This kind of problem with severe coeffient jnmps is dif- 
ficnlt by MG, while it is qnite efficiently solved by a combination of MG and 
CG. 

Fig. 8 shows floating-point performance for the problem size 512 x 512. Using 
the Omni compiler, the floating-point performance of the optimized version is 
clearly better than that of the naive version, and it is almost donble the perfor- 
mance for large nnmbers of processors. On the other hand, the difference of the 
two programs is qnite small and both programs achieve good performance with 
the SGI MIPSpro compiler, since the overhead of a parallel region constrnct of 
the MIPSpro compiler is small. In both compiler cases, the performance does 
not scale with more than 8 processors becanse the problem size is not so large. 
The elapsed execntion time of the optimized version is only 1.4 seconds with 8 
processors. 

The MPI program achieves qnite good performance with 8 processors. In fact, 
each processor achieves 44 MFLOPS with 8 processors, while only 32 MFLOPS 
with 1 processor. The MGCG program parallelized by MPI needs approximately 
32 MB of data when the problem size is 512 x 512. With 8 processors, each pro- 
cessor processes only 4 MB of data that fits into secondary cache. Unfortnnately, 
OpenMP programs cannot take this advantage so mnch. 

Fig. 9 is a resnlt of 1024 x 1024. In this case, approximately 128 MB of data is 
necessary. Up to 8 processors, all programs bnt the naive version with the Omni 
compiler achieve almost the same good performance. With more processors, the 
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Fig. 6. Magnified snapshot of the naive version 




Fig. 7. Magnified snapshot of the optimized version 
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naive version with the MIPSpro compiler and the optimized version with the 
Omni compiler slightly degrade the performance, while the MPI program and 
the optimized version with MIPSpro scale well. That is becanse the ratio of 
compntation on coarse grid, i.e., a loop with short loop length that tends to 
become the overhead of parallelization increases, when the nnmber of processes 
increases. This difference mainly comes from the overhead of a parallel region 
constrnct or a work-sharing constrnct as evalnated by the previons snbsection. 

Comparing with the performance of the OpenMP programs of two problem 
sizes, the problem size 512 x 512 is better than 1024 x 1024 in the case of eight 
threads. This shows that a program parallelized by OpenMP can ntilize memory 
locality slightly however it is no match for a program written in MPI. 



5 Proposals for OpenMP Specification 

When constant variables are privatized in a parallel region, both FIRSTPRIVATE 
and LASTPRIVATE need to be specified, since these constant variables become 
undefined at the end of the parallel region without LASTPRIVATE; however, this 
also may cause an overhead, since copying is necessary from a private variable 
to a shared variable at the end of the parallel region. This case is quite common. 
We propose a new data scope attribute clause; READONLY. This clause only copies 
from the shared variable to a private variable and ensures to maintain the shared 
variable after the parallel region. 

We employed an iterative method for performance evaluation. The iterative 
method is quite sensitive to rounding error and the order of floating-point oper- 
ations of reduction. Since the OpenMP specification allows any order of floating- 
point operations for reduction, an OpenMP program may compute a different 
inner product on successive execution even with the static loop scheduling, the 
same loop length and the same number of threads. Actually, even the number of 
iterations until convergence differs every time using the MIPSpro compiler. This 
may also happen with MPI since the MPI specification also allows any order of 
floating-point operations for reduction. The problem is that the difference with 
OpenMP is observed to be much larger than that with MPI. The situation is 
much worse using dynamic loop scheduling. Since reproducible results are quite 
important for debugging and testing, this should be controlled by a new directive 
or a new environmental variable. 



6 Conclusions and Future Research 

This paper showed several techniques to optimize OpenMP programs and their 
impact for the MGCG method. Impact is quite dependent on compilers. The 
Omni compiler showed attractive and almost double the performance compared 
with the naive version. The SGI MIPSpro compiler showed little difference of 
performance with small number of processors, however, the optimized version 
scales well with more processors. 
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Even though the SGI Origin 2000 supports distributed shared memory, the 
program written in MPI performs better than the OpeuMP program especially 
with 8 processors and the problem size 512 x 512. That is because the MPI 
program exploits locality of memory explicitly. How to exploit the locality is also 
a big future research of the OpeuMP compiler and the OpenMP specification. 
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Fig. 8. Floating-point performance of MGCG (512 x 512) 




Fig. 9. Floating-point performance of MGCG (1024 x 1024) 
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MPI Using a Large-Scale Application Suite* 
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Abstract. In this paper we provide quantitative information about the 
performance differences between the OpenMP and the MPI version of 
a large-scale application benchmark suite, SPECseis. We have gathered 
extensive performance data using hardware counters on a 4-processor 
Sun Enterprise system. For the presentation of this information we use 
a Speedup Component Model, which is able to precisely show the im- 
pact of various overheads on the program speedup. We have found that 
overall, the performance figures of both program versions match closely. 
However, our analysis also shows interesting differences in individual 
program phases and in overhead categories incurred. Our work gives ini- 
tial answers to a largely unanswered research question: what are the 
sources of inefficiencies of OpenMP programs relative to other program- 
ming paradigms on large, realistic applications. Our results indicate that 
the OpenMP and MPI models are basically performance-equivalent on 
shared-memory architectures. However, we also found interesting differ- 
ences in behavioral details, such as the number of instructions executed, 
and the incurred memory latencies and processor stalls. 



1 Introduction 

1.1 Motivation 

Programs that exhibit significant amounts of data parallelism can be written us- 
ing explicit message-passing commands or shared-memory directives. The mes- 
sage passing interface (MPI) is already a well-established standard. OpenMP 
directives have emerged as a new standard for expressing shared-memory pro- 
grams. When we choose one of these two methodologies, the following questions 
arise: 

— Which is the preferable programming model, shared-memory or message- 
passing programming on shared-memory multiprocessor systems? Can we 
replace message-passing programs with OpenMP without significant loss of 
speedup? 

* This work was supported in part by NSF grants #9703180-CCR and #9872516- EIA. 
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— Can both message-passing and shared-memory directives be used simultane- 
ously? Can exploiting two levels of parallelism on a cluster of SMP’s provide 
the best performance with large-scale applications? 

To answer such questions we must be able to understand the sources of overheads 
incurred by real applications programmed using the two models. 

In this paper, we deal with the first question using a large-scale application 
suite. We use a specific code suite representative of industrial seismic processing 
applications. The code is part of the SPEC High-Performance Group’s (HPG) 
benchmark suite, SPEChpc96 [I]. The benchmark is referred to as SPECseis, or 
Seis for short. Parallelism in Seis is expressed at the outer-most level, i.e., at 
the level of the main program. This is the case in both the OpenMP and the 
MPI version. As a result, we can directly compare runtime performance statistics 
between the two versions of Seis. 

We used a four-processor shared-memory computer for our experiments. We 
have used the machine’s hardware counters to collect detailed statistics. To dis- 
cuss this information we use the Speedup Component Model, recently introduced 
in [2] for shared memory programs. We have extended this model to account for 
communication overhead which occurs in message passing programs. 



1.2 Related Work 

Early experiments with a message passing and a shared-memory version of Seis 
were reported in [3] . Although the shared-memory version did not use OpenMP, 
this work described the equivalence of the two programming models for this 
application and machine class. The performance of two CFD applications was 
analyzed in [4]. Several efforts have converted benchmarks to OpenMP form. 
An example is the study of the NAS benchmarks [5,6], which also compared 
the MPI and OpenMP performances with that of SGI’s automatic parallelizing 
compiler. 

Our work complements these projects where it provides performance data 
from the viewpoint of a large-scale application. In addition, we present a new 
model for analyzing the sources of inefficiencies of parallel programs. Our model 
allows us to identify specific overhead factors and their impact on the program’s 
speedup in a quantitative manner. 



2 Characteristics of SPECseis96 

Seis includes 20,000 lines of Fortran and C code, and includes about 230 Fortran 
subroutines and 120 C routines. The computational parts are written in For- 
tran. The C routines perform file I/O, data partitioning, and message passing 
operations. We use the 100 MB data set, corresponding to the small data set in 
SPEC’s terminology. 

The program processes a series of seismic signals that are emitted by a single 
source which moves along a 2-D array on the earth’s surface. The signals are 
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reflected off of the earth’s interior structures and are received by an array of re- 
ceptors. The signals take the form of a set of seismic traces, which are processed 
by applying a sequence of data transformations. Table 1 gives an overview of 
these data transformation steps. The seismic transformation steps are combined 
into four separate seismic applications, referred to as four phases. They include 
Phase 1: Data Generation, Phase 2: Stacking of Data, Phase 3: Frequency Do- 
main Migration, and Phase 4- Finite- Difference Depth Migration. The seismic 
application is described in more detail in [7]. 



Table 1. Seismic Process. A brief description of each seismic process which makes 
up the four processing phases of Sets. Each phase performs all of its processing on every 
seismic data trace in its input file and stores the transformed traces in an output file. 
We removed the seismic process called RATE, which performs benchmark measurements 
in the official SPEC benchmark version of Sets 



Process 


Description 


Phase 1: Data Generation 


VSBF 


Read velocity function and provide access routines. 


GEOM 


Specify source/receiver coordinates. 


DGEN 


Generate seismic data. 


FANE 


Apply 2-D spatial filters to data via Fourier transforms. 


DCON 


Apply predictive deconvolution. 


NMOC 


Apply normal move-out corrections. 


PFWR 


Parallel write to output files. 


VRFY 


Compute average amplitude profile as a checksum. 


Phase 2: Stacking of Data 


PFRD 


Parallel read of input files. 


DMOC 


Apply residual move-out corrections. 


STAR 


Sum input traces into zero offset section. 


PFWR 


Parallel write to output files. 


VRFY 


Compute average amplitude profile as a checksum. 


Phase 3: Fourier Domain Migration 


PFRD 


Parallel read of input files. 


M3FK 


3-D Fourier domain migration. 


PFWR 


Parallel write to output files. 


VRFY 


Compute average amplitude profile as a checksum. 


Phase 4' Finite- Difference Depth Migration 


VSBF 


Data generation. 


PFRD 


Parallel read of input files. 


MG3D 


A 3-D, one-pass, finite-difference migration. 


PFWR 


Parallel write to output files. 


VRFY 


Compute average amplitude profile as a checksum. 
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The four phases transfer data through file I/O. In the current implementa- 
tion, previous phases need to run to completion before the next phase can start, 
except for Phases 3 and 4, which both migrate the stacked data, and therefore 
only depend on data generated in Phase 2. The execution times of the four 
phases on one processor of the Sun Ultra Enterprise 4000 system are: 



Data Generation 
Phase 1 


Data Stacking 
Phase 2 


Time Migration 
Phase 3 


Depth migration 
Phase 4 


Total 


272s 


62.2s 


7.1s 


1,201s 


1,542s 



More significant is the heterogeneous structure of the four phases. Phase 1 
is highly parallel with synchronization required only at the start and finish. 
Phases 2 and 4 communicate frequently throughout their execution. Phase 3 
executes only three communications, independent of the size of the input data 
set, and is relatively short. 

Figure 1 shows the number of instructions executed in each application phase 
and the breakdown into several categories using the SPIX tool [8]. The data was 
gathered from a serial run of Seis. One fourth of the instructions executed in 
Phase 4 are loads, contributing the main part of the memory system overhead, 
which will be described in Figure 4. Note that a smaller percentage of the in- 
structions executed in Phase 3 are floating-point operations, which perform the 
core computational tasks of the application. Phase 3 exhibits startup overhead 
simply because it executes so quickly with very few computation steps. 




Fig. 1. The Ratio of Dynamic Instructions at Run-Time, Categorized by Type. 
Instructions executed for the four seismic phases from a serial run were recorded 



Figure 2 shows our overall speedup measurements of MPI and OpenMP ver- 
sions with respect to the serial execution time. The parallel code variants execute 
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nearly the same on one processor as the original serial code, indicating that neg- 
ligible overhead is induced by adding parallelism. On four processors, the MPI 
code variant exhibits better speedups than the OpenMP variant. We will describe 
reasons in Section 4. 




Phase 1 Phase 2 Phase 3 Phase 4 Total Phase 1 Phase 2 Phase 3 Phase 4 Total 

(a) 1 Processor. (b) 4 Processors. 



Fig. 2. Speedups of the MPI and OpenMP Versions of Seis. Graph (a) shows the 
performance of each seismic phase as well as the total performance on one processor. 
Graph (b) shows the speedups on four processors. Speedups are with respect to the 
one-processor runs, measured on a Sun Enterprise 4000 system. Graph (a) shows that 
the parallel code variants run at high efficiency. In fact, parallelizing the code improves 
the one-processor execution of Phase-1 and Phase 4. Graph (b) shows that nearly ideal 
speedup is obtained, except with Phase 3 



3 Experiment Methodology 

3.1 Speedup Component Model 

To quantify and summarize the effects that the different compiling and program- 
ming schemes have on the code’s performance, we will use the speedup component 
model, introduced in [2]. This model categorizes overhead factors into several 
main components: memory stalls, processor stalls, code overhead, thread man- 
agement, and communication overhead. Table 2 lists the categories and their 
contributing factors. These model components are measured through hardware 
counters (TICK register) and timers on the Sun Enterprise 4000 system [9]. 

The speedup component model represents the overhead categories so that 
they fully account for the performance gap between measured and ideal speedup. 
For the specific model formulas we refer the reader to [2] . We have introduced the 
communication overhead category specifically for the present work to consider 
the type of communication used in Seis. The parallel processes exchange data 
at regular intervals in the form of all-to-all broadcasts. We define the communi- 
cation overhead as the time that elapses from before the entire data exchange 
(of all processors with all processors) until it completes. Both the MPI and the 
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Table 2. Overhead Categories of the Speedup Component Model 



Overhead 

Category 


Contributing 

Factors 


Description 


Measured 

with 


Memory stalls 


IC miss 


Stall due to I-Cache miss. 


HW Cntr 




Write stall 


The store buffer cannot hold ad- 
ditional stores. 


HW Cntr 




Read stall 


An instruction in the execute 
stage depends on an earlier load 
that is not yet completed. 


HW Cntr 




RAW load stall 


A read needs to wait for a pre- 
viously issued write to the same 
address. 


HW Cntr 


Processor stalls 


Mispred. Stall 


Stall caused by branch mispre- 
diction and recovery. 


HW Cntr 




Float Dep. stall 


An instruction needs to wait for 
the result of a floating point op- 
eration. 


HW Cntr 


Code overhead 


Parallelization 


Added code necessary for gener- 
ating parallel code. 


computed 




Code generation 


More conservative compiler op- 
timizations for parallel code. 


computed 


Thread management 


Fork&join 
Load imbalance 


Latencies due to creating and 
terminating parallel sections. 
Wait time at join points due to 
uneven workload distribution. 


timers 


Communication 


Load imbalance 


Wait time at communication 
points. 


timers 


overhead 


Copy operations 
Synchronization 


Data movement between proces- 
sors. 

Overhead of synch, operations. 





OpenMP versions perform this data exchange in a similar manner. However the 
MPI version uses send/receive operations, whereas the OpenMP version uses 
explicit copy operations, as illustrated in Figure 3. 

The MPI code uses blocking sends and receives, requiring processors to wait 
for the send to complete before the receive in order to swap data with another 
processor. The OpenMP code can take advantage of the shared-memory space 
and have all processors copy their processed data into the shared-space, perform 
a barrier, and then copy from the shared-space. 

3.2 Measurement Environment 

We used a Sun Ultra Enterprise 4000 system with six 248 MHz UltraSPARC 
Version 9 processors, each with a 16 KB LI data cache and 1 MB unified L2 cache 
using a bus-based protocol. To compile the MPI and serial versions of the code 
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FOR p=1 TO <all other processors> COPY Work(...p...) GlobalBuffer(...p...) 

send(p,Work(...p...)) 

BARRIER 

FOR p=1 TO <all other processors> 

receive(p,Work(...p...)) FOR P=1 TO <all other processors> 

COPY GlobalBuffer(...p...)-^ Work{...p...) 



Fig. 3. Communication Scheme in Seis and its Implementation in MPI and 
OpenMP 



we use the Sun Workshop 5.0 compilers. The message-passing library we used 
is the MPICH 1 . 2 implementation of MPI, configured for a Sun shared-memory 
machine. The shared-memory version of Seis was compiled using the KAP/Pro 
compilers (guidef77 and guidec) on top of the Sun Workshop 5.0 compilers. 
The flags used to compile the three different versions of Seis were -fast -05 
-xtarget=ultra2 -xcache=16/32/l : 1024/64/1. 

We used the Sun Performance Monitor library package and would make 14 
runs of the application, gathering hardware counts of memory stalls, instruction 
counts, etc. Using these measurements we could describe the overheads seen in 
the performance of the serial code and difference between observed and ideal 
speedup for the parallel implementations of Seis. The standard deviation for all 
these runs was negligible, except in one case mentioned in our analysis. 

4 Performance Comparison between OpenMP and MPI 

In this section we first inspect the overheads of the 1-processor executions of the 
serial as well as the parallel program variants. Next, we present the performance 
of the parallel program executions and discuss the change in overhead factors. 



4.1 Overheads of the Single-Processor Executions 

Figure 4 shows the breakdown of the total execution time into the measured 
overheads. “OTHER” captures all processor cycles not spent in measured stalls. 
This category includes all productive compute cycles such as instruction and data 
cache hits, and instruction decoding and execution without stalls. It also includes 
stalls due to I/O operations. However we have found this to be negligible. For 
all four phases, the figure compares the execution overheads of the original serial 
code with those of the parallel code running on only one processor. The difference 
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in overheads between the serial and single-processor parallel executions indicate 
performance degradations due to the conversion of the original code to parallel 
form. Indeed, in all but the Fourier Migration code (Phase 3) the parallel codes 
incur more floating-point dependence stalls than the serial code. This change is 
unexpected because the parallel versions use the same code generator that the 
serial version uses, except that they link with the MPI libraries or transform 
the OpenMP directives in the main program to subroutines with thread calls, 
respectively. 



B COMMUNICATION i □ OTHER 
H STALL FPDEP 0 STALL MISPRED 

■ LOAD_STALL_RAW □ STALL_LOAD 

□ STALL_STOREBUF DD STALLJC_MISS 



Code Overhead i 
Processor Stalls i 

Memory Stalls 




Fig. 4. Overheads for One-Processor Runs. The graphs show the overheads found 
in the four phases of Seis for the serial run and the parallel runs on one processor. 
The parallel versions of the code cause more of the latencies to be within the FP units 
than the serial code does. Also, notice that the loads in the Finite-Difference Migration 
(Phase 4) cause less stalls in the parallel versions than in the serial code. In general, 
the latencies accrues by the two parallel version exhibit very similar characteristics 



Also from Figure 4, we can see that in Phases 1, 2, and 4 compiling with 
the parallel environment reduces the “OTHER” category. It means that the in- 
structions excluding all stalls execute faster in the 1-processor run of the parallel 
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code than in the serial code. This can be the result of higher quality code (more 
optimizations applied, resulting in less instructions) or in an increased degree of 
instruction-level parallelism. Again this is unexpected, because the same code 
generator is used. 

In Phase 4, the parallel code versions reduce the amount of load stalls for 
both the one-processor and four-processor runs. The parallel codes change data 
access patterns because of the implemented communication scheme. We assume 
that this leads to slightly increased data locality. 

The OpenMP and MPI programs executed on one processor perform simi- 
larly, except for Phase 3. In Phase 3, the OpenMP version has a higher “OTHER” 
category, indicating less efficient code generation of the parallel variant. How- 
ever, Phase 3 is relatively short and we have measured up to a 5% performance 
variance in repeated executions. Hence, the shown difference is not significant. 

4.2 Analysis of the Parallel Program Performance 

To discuss how the overheads change when the codes are executed in parallel we 
use the Speedup Component Model, introduced in Section 3.1. The results are 
given in Figure 5 for MPI and OpenMP on one and four processors in terms of 
speedup with respect to the serial run. The upper bars (labeled “P=l”) present 
the same information that is displayed in Figure 4. However, the categories are 
now transformed so that their contributions to the speedup become clear. In the 
upper graphs, the ideal speedup is 1. The effect that each category has on the 
speedup is indicated by the components of the bars. A positive effect, indicated 
by the bar components on top of the measured speedup, stands for a latency that 
increases the execution time. The height of the bar quantifies the “lack of ideal 
speedup” due to this component. A negative component represents an overhead 
that decreases from the serial to the parallel version. Negative components can 
lead to superlinear speedup behavior. The sum of all components always equals 
the number of processors. For a one-processor run, the sum of all categories 
equals one. 

The lower graphs show the four-processor performance. The overheads in 
Phase 1 remain similar to those of the one-processor run, which translates into 
good parallel efficiency on our four-processor system. This is expected of Phase 1, 
because it performs highly parallel operations but only communicates to fork 
processes at the beginning and join them at the end of the phase. Phase 2 of 
the OpenMP version shows a smaller improvement due to the code generation 
overhead component, which explains why less speedup was measured than with 
the MPI version. Again, this difference is despite the use of the same code gener- 
ating compiler and it shows up consistently in repeated measurements. Phase 3 
behaves quite differently in the two program variants. However this difference is 
not significant, as mentioned earlier. 

Figure 5 shows several differences between the OpenMP and the MPI imple- 
mentation of Seis. In Phase 4 we can see the number of memory system stalls is 
less in the OpenMP version than in the MPI version. This shows up in the form 
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Fig. 5. Speedups Compared Model for the Versions of Seis. The upper graph 
displays the speedups with respect to the serial version of the code and executed on only 
one processor. The lower graph shows the speedups obtained when executing on four 
processors. An overhead component represents the amount that the measured speedup 
would increase (decrease for negative components) if this overhead were eliminated and 
all other components remained unchanged 



of a negative memory system overhead component in the OpenMP versions. In- 
terestingly, the MPI versions has the same measured speedup, as it has a larger 
negative code generation overhead component. Furthermore, the processor sys- 
tem stalls decrease in the 4-processor execution, however this gain is offset with 
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an increase in communication overheads. These overheads are consistent with 
the fact that Phase 4 performs the most communication out of all the phases. 

Overall, the parallel performance of the OpenMP and the MPI versions of 
Seis are very similar. In the most time-consuming code, Phase 4, the performance 
is the same. The second- most significant code. Phase 2, shows better performance 
with MPI than with OpenMP. However, our analysis indicates that the reason 
can be found in the compiler’s code generation and not in the programming 
model. The communication overheads of both models are very small in Phases 1, 
2, and 3. Only Phase 4 has a significant communication component and it is 
identical for the MPI and OpenMP variants of the application. 

5 Conclusions 

We have compared the performance of an OpenMP and an MPI version of a 
large-scale seismic processing application suite. We have analyzed the behav- 
ior in detail using hardware counters, which we have presented in the form of 
the speedup component model. This model quantifies the impact of the various 
overheads on the programs’ speedups. 

We have found that the overall performance of the MPI and OpenMP variants 
of the application is very similar. The two application variants exploit the same 
level of parallelism, which is expressed equally well in both programming models. 
Specifically, we have found that no performance difference is attributable to 
differences in the way the two models exchange data between processors. 

However, there are also interesting differences in individual code sections. 
We found that the OpenMP version incurs more code overhead (e.g., the code 
executes more instructions) than the MPI version, which becomes more pro- 
nounced as the number of processors is increased. We also found situations where 
the OpenMP version incurred less memory stalls. However, we do not attribute 
these differences to intrinsic properties of any particular programming model. 

While our studies basically show equivalence of the OpenMP and MPI pro- 
gramming models, the differences in overheads of individual code sections may 
point to potential improvements of compiler and architecture techniques. Inves- 
tigating this potential is the objective of our ongoing work. 
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Abstract Direct numerical simulation(DNS) of fundamental fluid 
flow simulation using 3 -dimensional Navier-Stokes equations is a 
typical large scale computing which requires high performance 
computer with vector and parallel processing. In the present paper a 
turbulent boundary layer flow simulation with strong adverse pressure 
gradient on a flat plate was made on NAL Numerical Wind Tunnel. 
Boundary layer subjected to a strong adverse pressure gradient creates a 
separation bubble followed by a region with small, but positive, skin 
friction. This flow case contains features that has proven to be difficult 
to predict with existing turbulence models. The data from present 
simulation are used for investigation of the scalings near the wall, a 
crucial concept with respect to turbulence models. The present analysis 
uses spectral methods and the parallelization was done using 
MPI(Message-Passing Interface). A good efficiency was obtained in 
NWT. To compare with other machine performances, previous 
computations on T3E and SP2 are also shown. 



1 Introduction 

The near wall scaling of the mean velocities are very important for the correct 
behavior of wall damping functions used when turbulence models are used in the 
Reynolds averaged Navier-Stokes equations (RANS). For a zero pressure 
gradient(ZPG) boundary layer, the damping functions and boundary conditions in the 
logarithmic layer are based on a theory where the friction velocity, u„ is used as a 
velocity scale. However, in the case of a boundary layer under an adverse pressure 
gradient (APG), Mj is not the correct velocity scale, especially for a strong APG and 
moderate Reynolds number. In the case of separation this is clear since M,becomes 
zero. The combination of a pressure gradient and moderate Reynolds number give a 
flow that deviates from the classical near wall laws. The near wall behavior of a 
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turbulent boundary layer elose to separation is studied using direct numerical 
simulations (DNS) on massively parallel computers. The results are analyzed and can 
be used to improve the near wall behavior in turbulence models for flows with 
separation. This will be extremely important because the existing turbulence models 
which are used in CFD codes for aerodynamic design and analysis fail to predict 
separation or even a flow field under strong adverse pressure. 

In the computational aspects, DNS of Navier-Stokes equations is a typical large 
scale computing best fitted for vector and parallel computations. In the past, DNS has 
been made for simple flows such as homogeneous isotropic turbulence, and it takes an 
enormous amount of computing time [1]. DNS for simple but realistic flow such as 
2D flow on a fiat plate was difficult because such flow is spacially developing and 
spectral methods can not be applied. In DNS, spectral methods have been used 
because of its high accuracy in time and space. In most cases, finite difference 
methods for physical space will not be applied because of their low accuracy. Merits 
of spectral methods are their periodicity characteristics. But it becomes demerits when 
spacially developing flow is the target. Such flow is not periodical at least in x or flow 
direction and spectral methods can not be applied in that direction. To overcome this 
difficulty, an idea called "fringe region technique" is incorporated in the present 
analysis and periodicity condition was realized in x-direction as well as in other 
directions. Thus, the paper treats flow filed with strong adverse pressure gradient 
enough to create reversed flow close to the fiat plate wall, i.e. separation, and will get 
informations on turbulent characteristics of such flow field where the existing 
turbulence models have difficulty in prediction. 

In the present analysis, spectral methods were applied and the parallelization 
was done using MPI(Message-Passing Interface). A good efficiency was obtained in 
NWT. The original code was developed at KTH and FFA using FORTRAN?? [2]. 
We did not try to convert to NWT-FORTRAN because of limited time available for 
joint work between NAL and KTH. The original ID FFT subroutine was run on 
NWT. 3D-FFT subroutine is available for NWT-FORTRAN and it should give more 
faster performance. In the future work we may try it. To compare with other machine 
performances, previous computations on T3E and SP2 at KTH are also shown. 



2 Numerical Method and Parallelization 

The numerical approximation consists of spectral methods with Fourier discretization 
in the horizontal directions and Chebyshev discretization in the normal direction. 
Since the boundary layer is developing in the downstream direction, it is necessary to 
use non-periodic boundary conditions in the streamwise direction. This is possible 
while retaining the Fourier discretization if a fringe region is added downstream of the 
physical domain. In the fringe region the flow is forced from the outflow of the 
physical domain to the inflow. In this way the physical domain and the fringe region 
together satisfy periodic boundary conditions. 
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Fringe region technique 
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Figure 1 Fringe region technique 

Time integration is performed using a third order Runge-Kutta method for the 
advective and forcing terms and Crank-Nicolson for the viscous terms. A 2/3- 
dealizing rule is used in the streamwise and spanwise direction. The numerical code is 
written in FORTRAN and consists of two major parts (figure 2), one linear part 
(linear) where the actual equations are solved in spectral space, and one non-linear 
part ( nonlin) where the non-linear terms in the equations are computed in physical 
space. All spatial derivatives are calculated in the spectral formulation. The main 
computational effort in these two parts is in the FFT. 




Figure 2 The main structure of the program 

In the linear part one Ay-plane is treated separately for each z-position. The field is 
transformed in the y direction to spectral space, a solution is obtained and then 
transformed to physical space in the y direction. This is performed with an loop over 
all z values where the subroutine linear is called for each z. The xy-planes are 
transferred from the main storage with the routine getxy to the memory where the 
actual computations are performed. The corresponding storing of data is performed 
with putxy. 

In the non-linear part the treatment of the data is similar to that in the linear part. 
One xz -plane is treated separately for each y-position. The field is transformed in both 
the X and z directions to physical space where the non-linear terms are computed. 
Then the field is transformed in the x and z directions to spectral space. This is 
performed with a loop over all y values where the subroutine nonlin is called to at 
each y. The xz-planes are transferred from the main storage with the routine getxz to 
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the memory where the actual computations are performed. The corresponding storing 
of data is performed with putxz. 

Communication between processors is necessary in the two different parts of the 
code. The data set (velocity field) is divided between the different processors along 
the z direction, see figure 3a. Thus, in the linear part, no communication is needed 
since each processor has data sets for z-position(vy-planes). When the non-linear 
terms are calculated, each processor needs data for a horizontal plane (xz-planes). The 
main storage is kept at its original position on the different processors. In the non- 
linear part each processor collects the two dimensional data from the other processors, 
on which it performs the computations, and then redistributes it back to the main 
storage. Figure 3b shows an example of the data gathering for one processor. 





Figure 3 a)The distribution of the main storage on four processors ip=l,...,4. b)The gathering 
of data in the nonlinear part (nonlin) of the code for processor number two. The completely 
shaded area is local on the processor and need not to be received from the others, and the half- 
shaded area is sent to processor number two. The x-direction is omitted for clarity 



3 Numerical Parameters 

The first simulation was made on Cray T3E at NSC in Linkdping, using 32 
processors. The tuning of the pressure gradient for the desired flow situation was 
performed. After the design of the pressure gradient, a simulation with 20 million 
modes was performed on a IBM SP2 at PDC, KTH in Stockholm, using 32 
processors. The same size of 20 million modes was made(maximum on SP2). Fuilher 
large scale computation should be made to validate the 20 million result and possibly 
to obtain more refined resolution of the flowfield. Numerical Wind Tunnel(NWT) at 
NAL was selected for this purpose. A 40 million modes analysis was made on NWT 
using 64 nodes. The same code was used on all three computers, using MPI 
(Message-Passing Interface) for the communication between the processors. NWT is 
a parallel vector processor consisting of 166 nodes. Each node is 1.7GFLOPS. NWT 
uses vector processing with higher performances while the previous machines use 
super-scaler processors. For comparison between the three computers for the full 
simulation, see table 1 . 

The simulations start with a laminar boundary layer at the inflow which is 
triggered to transition by a random volume force near the wall. All the quantities are 
non-dimensionalized by the freestream velocity (U) and the displacement thickness 
((5*) at the starting position ofDthe simulation {x=0) where the flow is laminar. At 
that position 7?^*=400. The length (including the fringe), height and width of the 
computation box were 700 x 65 x 80 in these units(see table 2). 
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Table 1. The performance of the code given in Mflop/s for the 20 million mode simulation on 
T3E and SP2, and 40 million mode simulation on NWT 





T3E 


SP2 


NWT 


peak processor 
performance 


600 


640 


1700 


code performance 
per processor 


30 


50 


320 


total performance 
on 64 processors 


1900 


3200 


20500 



Table 2. Computational box 



Case 


Lx 


Ly 


Lz 


APG 


700 


65 


80 


SEP 


700 


65 


80 



The number of modes in this simulation was 720 x 217 x 256, whieh gives a total 
of 40 million modes or 90 million collocation points. The fringe region has a length 
of 100 and the trip is located at x=10. The simulations were run for a total of 7500 
time units (d*/U) starting from scratch, and the sampling for the turbulent statistics 
was performed during the 1000 last time units. Actual NWT CPU Time is about 700 
hours using 64 nodes(see table 3). 

Table 3 Number of modes 



Case 


NX 


NY 


NZ 


N 


APG 


512 


193 


192 


19x10^6 


SEP 


720 


217 


256 


< 

O 
\ — 1 
X 
o 



4 Performance of the Code 

In figure 4 the performance of the code on the two super scalar computers is shown as 
Mflop/s together with the optimal speed. This is a small test case and the performance 
is lower than for a real simulation on many processors. The scaling is better on the 
T3E, while the overall performance is better on the SP2, which is approximately twice 
as fast as the T3E. The NWT is over six times as fast as the SP2, which give a 
performance of 20 Gfiop/s on 64 processors for the simulation with 40 million modes, 
table 1. Most fast record of FFT computation on NWT was made by a code 
BIGCUBE and it showed 90.3GFLOPS using 128 nodes. This is a record presented to 
SC94 Gordon Bell Prize Award[l]. The value corresponds to about 45.1 GFLOPS 
with 64 nodes. Note that on NWT, linear scalability is attained up to maximum nodes 
number. Present result of 20 GFLOPS is about a half less than the maximum record. 
The reason of this is that Gordon Bell record uses 3-D FFT and NWT FORTRAN. 
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Also pre-fetching and overlapping of data I/O was fully used. In the present 
computation, MPI was used for 1-D FFT. FORTRAN 77 with MPI was used. Since 
we do not have enough time for source code conversion to NWT FORTRAN, we did 
not change the code. Even with this not-optimized code condition, it is remarkable 
that NWT showed this performance. 




Figure 4 Mflop/s rates for different number of processors for a small test case. — T3E - - SP2 



5 Results 

Results from smaller simulations with weaker pressure gradients have been fully 
analyzed and presented in [3]. These simulations were an important step towards the 
strong APG case presented here. The free stream velocity varies according to a power 
law in the down stream coordinate, U ~ x"'. In the present simulation the exponent m 
is equal to -0.25. The friction velocity, u„ is negative where separation, i.e. reversed 
flow, occurs. The boundary layer has a shear stress very close to zero at the wall for a 
large portion in the down stream direction as seen from figure 5. The separated region 
is located between x=150 to x~300. 

For a zero pressure gradient (ZPG) boundary layer the velocity profile in the 
viscous sub-layer is described by where superscript -f denotes the viscous 

scaling based on Wj.Under a strong APG this law is not valid and from the equations 
the following expression can be derived, 

if = 1/2 y/’^ - (u/upf y^ . 

The viscous scaling based on w^has to abandoned since becomes zero at 
separation. A different velocity scale, based on the pressure gradient, can be derived 
from the equations valid in the near wall region. This velocity, Up , replaces Mr as the 
scaling parameter and the scaled quantities are denoted by superscript p instead of -f. 

Comparing velocity profiles from DNS with the profile above is done in figure 6. 
This figure shows velocity profiles near the wall for x=250 and x=300 in pressure 
gradient scaling. The higher profile is located at x=250. Both profiles are from within 
the separated region. The solid lines are DNS data and the dashed are the profiles 
given by equation 1 . 
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Figure 5 — U - - x 10 Figure 6 — u’’ - - 

The data from DNS follows the theoretical curve in a region close to the wall. 
Equation 1 is valid only in a region where the flow is completely governed by viscous 
forces. This region is very small at low Reynolds numbers, hence the limited overlap 
in figure 6. The agreement of DNS data and the theoretical expression in the near wall 
region indicates that boundary conditions for turbulence models can be improved to 
give proper results even in a separated region. The near wall behavior is a crucial step 
in turbulence modeling, and the new result from this simulation has a potential to 
dramatically improve the wall damping functions. 
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Abstract. Macroscopic elastic properties of materials depend on the 
underlying microscopic structures. We have investigated the topological 
structure of three-dimensional network glass, such as vitreous Si02, and 
its effect on the rigidity, using a parallel molecular-dynamics (MD) ap- 
proach. The topological analysis based on the graph theory is employed 
to characterize disordered networks in the computer generated model of 
vitreous Si02. The nature of connectivity of the elementary units beyond 
the nearest- neighbor, which is related to the medium-range order struc- 
ture of amorphous state, is described in terms of the ring distribution 
by the shortest-path analysis. In large-scale MD simulations, the task 
of detecting these rings from a large amount of data is computationally 
demanding. Elastic moduli of vitreous Si02 are calculated with the fluc- 
tuation formula for internal stress. The quantitative relation between the 
statistics of rings for vitreous Si02 and the elastic moduli are discussed. 

Keyword: parallel molecular dynamics simulation, vitreous Si02, net- 
work connectivity, elastic properties, fluctuation formula 
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Abstract. In the present study, turbulent heat transfer in open-channel 
flows has been numerically investigated by means of a Direct 
Numerical Simulations (DNSs) with a constant temperature at both free 
surface and bottom wall. The DNSs were conducted for two Prandtl 
number, 1.0 and 5.0 with a neutral (i.e., zero gravity) or stable 
stratification (Richardson number; 27.6), while a Reynolds number of 
200, based on the friction velocity and flow depth. As the results, the 
coherent turbulent structures of fluid motion and thermal mixing, and 
the influence of Pr change for heat transfer, buoyancy effect for 
turbulent structures and hear transfer, and relationship among them, are 
revealed and discussed. 



1 Introduction 

Free surface turbulent flows are very often found in the industrial devices such as a 
nuclear fusion reactor and a chemical plant, not to speak of those in river and ocean. 
Therefore, to investigate the turbulent structures near free surface is very important to 
understand the heat and mass transport phenomena across the free surface. From like 
this viewpoint, some DNSs with scalar transport including a buoyancy effect were 
carried out [1], [2]. The interesting information about the relationship between 
turbulent motion, so-called "surface renewal vortex" [3], and the scalar transport, and 
also the interaction between buoyancy and turbulence were obtained. However, in 
these studies, molecular diffusivity of scalar was comparable to that of momentum, it 
is questionable whether these results can be used for higher Prandtl or Schmidt 
number fluid flows. The accuracy of experimental or DNS database near free-surface 
has not been enough to make any turbulence model at the free-surface boundary until 
now, especially considered the effects of the Prandtl number of fluids, the buoyancy 
and the surface deformation on the flow, heat and mass transfer. 

The aims of this study are to clarify the buoyancy effect on the turbulent structure 
under various Prandtl number conditions and to reveal the turbulent heart transfer 
mechanism in open-channel flows, via DNS. 
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2 Numerical Procedure 



2.1 Governing Eqnations 



Governing equations are ineompressible Navier-Stokes equations with the Boussinesq 
approximation, the eontinuity equation and the energy equation: 



du: * du 

-+Uj — 

dt ■' 5 X 



j 



8 


P 


8xi 


P . 



+ V- 



a2 * 

s Uj 
dxj dxj 



( 1 ) 



t?Xj 



=0 



( 2 ) 



80 * 80 8^0 ,,, 

htt; = a . (3) 

dt ^ 8Xj 8xj8xj 

where u* is i th-eomponent of veloeity (1=1, 2, 3), Xi(x) is a streamwise direetion, 
X 2 (y) is a vertieal direetion, x^iz) is a spanwise direetion, t is time, /? is the thermal 
eoeffieient of volumetrie expansion, a super seript * denotes the instantaneous value, 
g is the gravitational foree, p* is the pressure, p is the fluid density, v is the kinetie 

viseosity, and the dimensionless temperature is defined as 0* ={T -Twall)! > 
temperature differenee is defined as A7 =T^urface ~ '^waii ’ ^surface denotes free surfaee 
temperature, denotes wall temperature, a is the thermal diffusivity, respeetively. 



2.2 Numerical Method and Boundary Condition 

Numerical integration of the governing equations is based on a fractional step 
method [4] and time integration is a second order Adams-Bashforth scheme. A second 
order central differencing scheme [5], [6] is adapted for the spatial discretization. The 
computational domain and coordinate system are shown in Fig. 1. 

As the boundary conditions for fluid motion, free-slip condition at the free surface, 
no-slip condition at the bottom wall and the cyclic conditions in the stream- and the 
spanwise- directions are imposed, respectively. As for the equation of energy, 
temperatures at the free surface and the bottom wall are kept constant (Tsurface > TVaii)- 



2.3 Numerical Method and Boundary Condition 

Numerical conditions are tabled in Table 1, where R^ = u^h!v is a turbulent 
Reynolds number based on a friction velocity of the neutral stratification and the flow 
depth h, and Ri= PgtJ' hlu^ is a Richardson number. The computations were carried 
out for about 2000 non-dimensional time units ( !v) and all statistical values were 
calculated by time and spatial averages over horizontal planes (homogeneous 
directions), after flows reached to a fully developed one. However, in case of the 
stable stratification (Ri=27.6) for Pr=1.0, a laminarization of the flow was appeared. 




504 Yoshinobu Yamamoto et al. 



SO the computation was stopped at the 1200 non-dimensional time units from the 
initial turbulent condition of a fully-developed neutral stratification case for the 
passive scalar. All quantities normalized by the friction velocity of the neutral 
stratification, kinetic viscosity and the mean heat flux at the free surface, are denoted 
by the super script -i-. 



3 Results and Discussion 

3.1 Coherent Turbulent Structures in Open-Channel Flow 

Figures 2-4 show the visualization of coherent turbulent structures in case of the 
neutral stratification. Figure 2 shows the iso-surface representation of a second 
invariant velocity gradient tensor l/2(5«* /5xy ■ /5x, ) . The iso-surface 

regions are corresponding to the strong vorticity containing regions. Near the bottom 
wall, the streamwise vortex stretched out the streamwise direction can be seen. This 
indicates that turbulence is generated near the wall. However, the free surface has no 
contribution to the turbulence generation in open-channel flows at low Reynolds 
number. 

Figure 3-(b) shows a top view of fluid markers being generated along a line to the 
z-axis at y^=12.35. Alternating high and low speed regions can be seen near the 
bottom wall. In these like, turbulence structure near the wall is as well as the wall 
turbulence of ordinary turbulent channel flows. Figure 3-(a) shows a side view of 
fluid markers being generated along a line to the y-axis at z^=270. As well as the wall 
turbulence, the lift-up of low-speed streaks, so-called the "burst", are depicted. 
However, in open-channel flows, if this burst reaches to the free surface, a typical 
turbulence structure affected by the free surface could be appeared underneath the 
free surface. Near the free surface, the effect of the velocity gradient on the turbulent 
structure is reduced by a very large horizontal vortex as shown in Fig. 3-(c) as well as 
the effect of the flow depth scale. This horizontal vortex impinges onto the free 
surface and turns toward the wall. This motion is in good agreement with the flow 
visualization experiment [7], i.e., it may correspond to the "surface renewal vortex." It 
is also consistent with turbulent statistic results of neutral stratification [2], [3], [9]. 
The mean velocity, the turbulent statistics and the budget of Reynolds stresses are 
published in elsewhere [9]. 

Figure 4 shows an instantaneous temperature field. Since the lifted-up cold fluids 
near the bottom wall and the down-drafted warm fluids near the free surface caused 
by this vortex motion are observed, a thermal mixing between these motions has been 
conducted. These typical fluid motions could be the main reason of heat transfer 
enhancement in the turbulent open-channel flows despite the neutral or stable 
stratification. 
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3.2 Statistics of Turbulence 

Mean velocity profiles are shown in Fig. 5. In the stable stratification and high Pr 
(=5.0) case, the flow laminarization was clearly observed near the free surface while it 
is no difference from the neutral case near wall region. On the other hand, in the 
stable stratification and low Pr (=1.0) case, the turbulence throughout the flow cannot 
be maintained. 

The turbulent intensity profiles are shown in Fig. 6. Near the free surface, all 
components of turbulent intensity are constrained by the stable stratification. 
Reynolds-stress profiles are shown in Fig. 7. There is a slightly negative value near 
the free surface in case of the stable stratification. 

Mean temperature profiles are shown in Fig. 8. The mean temperature gradient for 
the stable case (P^5.0, dotted line) is compared with the neutral case (Pr=5.0, solid 
line). This might indicate that a local heat transfer for the stable case may be 
promoted by the buoyancy effect. However, a bulk mean temperature of the stable 
case (Pr=5.0) is the lowest of all cases, total heart transfer itself seems to be 
constrained by the stable stratification. 

Figure 9 shows the scalar flux and fluctuation profiles in case of neutral 
stratification. It can be seen that the scalar fluctuation is produced by the mean 
velocity gradient near the wall, and the mean scalar gradient near the free surface. 
However, these profiles are distinct from each other. Especially, in the neutral case 
(P^5.0), the turbulent scalar statistics amount to maximums at near free surface 
where typical turbulence structures are existent. These may suggest that if the Prandtl 
number is higher, the heat transfer is enhanced by turbulent structures near the free 
surface. 

Wall-normal turbulent scalar flux profiles are shown in Fig. 10. In the neutral and 
lower Pr (=1.0) case, the profile of turbulent heat flux is almost symmetry, and in the 
neutral and higher Pr (=5.0) case, the profile leans toward the bottom wall. In the 
stable case (Pr=5.0), it leans toward the free surface caused by the buoyancy effect. 

A scale difference between the neutral and the stable cases may be concerned with 
a normalization method based on the friction velocity of the neutral stratification, etc. 
Figures 1 1 and 12 show the budgets of the Reynolds shear stress and turbulent kinetic 
energy in the stable case (P^5.0). As for the Reynolds stress as shown in Fig. 11, a 
buoyancy production (solid line) is actively conducted near the wall. In the turbulent 
kinetic energy as shown in Fig. 12, a stable stratification does not affect the turbulent 
energy budget near the wall. It is shown the reason why the momentum boundary 
layer thickness is thinner than that of the thermal boundary layer caused by the above 
local heat transfer mechanism. These are consisting with the results of mean velocity 
and scalar profiles. 

In the neutral case, instantaneous turbulent temperature fields near the free surface 
are shown in Fig. 13. A scalar field is transferred with the fluid motions, so-called a 
"surface renewal vortex." However, in case of Pr=5.0, the filamentous high 
temperature fragments are kept because the time scale of the fluid motion is so fast 
compared with the thermal diffusion time scale. This filamentous structure might be 
closely concerned with the local heat transfer and Counter-Gradient Flux (CGF) [10] 
as shown in Fig. 14. This indicates that we have to pay attention whether the 
Boussinesq approximation for high Prandtl or Schmidt number fluids can be assumed. 




506 Yoshinobu Yamamoto et al. 



4 Conclusions 

In this study, Direct Numerical Simulations of two-dimensional fully developed 
turbulent open-channel flows were performed. The main results can be summarized as 
follows: 

(1) According to the flow visualization, Near the free surface, a large horizontal 
vortex as well as the flow depth scale affected by the presence of free surface is 
enhanced the heat transfer. 

(2) If the Prandtl number is higher, turbulent structures near the free surface greatly 
impact on the scalar transport and the Reynolds analogy between a momentum and a 
scalar transports could not be applied in near free surface region. The reason is that 
the filamentous high temperature fragments are kept because the time scale of the 
fluid motion is so fast compared with the heat diffusion time scale. 

(3) By the stable stratification effect, if the Prandtl number is lower, the flow could 
not maintain the turbulence and be impacted on the turbulence structures near the free 
surface. 

(4) By the buoyancy effect, the wall-normal turbulent scalar flux in the stable case 
is locally enhanced near the wall and its statistical scalar profile is the opposite one of 
the neutral stratification case. Eventually, the total heat transfer itself was constrained. 
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Fig.l Computational domain and coordinate system 



Table 1 Numerical condition 



Rr 


Grid Number 
(x, y, z) 


Resolution 

(Av+,Ay+,Az+) 


Pr 


Ri 


200 


128,108,128 


10,0.5-4,5 


1.0 


- 


200 


128,108,128 


10,0.5-4, 5 


1.0 


27.6 


200 


256,131,256 


5, 0.5-2, 2.5 


5.0 


- 


200 


256,131,256 


5, 0.5-2, 2.5 


5.0 


27.6 




(a) Side view 



Fig. 2 Surfaces of second invariant velocity gradient tensor g =0.03 




508 Yoshinobu Yamamoto et al. 




(b) Top view 




(c) Bird view 



Fig. 2 (continue) Surfaces of second invariant velocity gradient tensor 
0"^=O.O3 
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(b) Fluid markers are generated along a line to the z-axis 
Near bottom wall, >>^=12.35 (Top view) 




(c) Fluid markers are generated along a line to the z-axis 
Near free-surface, >>^=194. 1 (Top view) 

Fig. 3 Visualization of coherent structures 
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Surface renewal vortex 




(a) Side view O.O(Black)<0*<l.O(White) z+=270 





(b) Top view O.l(Black)<0 <0.9(Wbite)>'"*^=12. 35 Near bottom wall 




(c) Top view O.7(Black)<0*<l.O(Wbite)y^=196nNear free-surface 



Fig.4 Instantaneous scalar fields Pr=\.Q 
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Fig. 5 Mean velocity profiles 




Fig.7 Shear stress profiles 




F 

Fig.9 Scalar flux and fluctuation profiles 
(Neutral stratification) 




Fig. 6 Turbulent intensity profiles 




T 

Fig. 8 Mean scalar profiles 




T 

Fig. 10 Wall-normal scalar profiles 
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Fig. 11 Budget oftTV (Pr=5.0) 
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Fig. 12 Budget of turbulent energy (Pr^S.Q) 




Fig. 13 Turbulent scalar fields (Top view), -0.27 (Black)<0nO.27 (White) 
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(a) Side view 0. 1 (Black ) < 9* < 0.9 (White ) 





Z,/=1280 



(C) Top view 0.4 (Black ) < 9* < 0.9 (White ) 



Fig. 14 Instantaneous scalar fields Neutral stratification case) 
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Abstract. Turbulent transport computations for fully-developed turbu- 
lent pipe flow were carried out by means of a direct numerical simulation 
(DNS) procedure. To investigate the effect of Reynolds number on the 
turbulent sturcures, the Reynolds number based on a friction velocity and 
a pipe radius was set to be Re-r — 150, 180, 360, 500, 1050. The number of 
maximum computational grids used for Ret — 1050 is 1024 x 512 x 768 
in the z—, r— and (j> -directions, respectively. The friction coefficients are 
in good agreement with the empirical correlation. The turbulent quan- 
tities such as the mean flow, turbulent stresses, turbulent kinetic energy 
budget, and the turbulent statistics were obtained. It is found that the 
turbulent structures depend on these Reynolds numbers. 



1 Introduction 

The turbulent transport mechanism in a pipe flow is of great importance from 
engineering viewpoint. Up to present, a number of experimental studies have 
been carried out various heating conditions. For example, Hishida et al. [1] and 
Isshiki et al.[2] measured fully-developed pipe flow with constant temperature 
and constant heat flux, respectively. In present study, the final object is to elu- 
cidate heat transfer mechanism of constant heat flux in a pipe flow by using 
Direct Numerical Simulations (DNS, hereafter). DNS has played important roles 
in numerical study at turbulent pipe flow. Despite the importance of turbulence 
modelling and studying turbulence phenomenon by using DNS and DNS of the 
velocity filed in turbulent pipe flow was only carried out [5]; furthermore thermal 
held do not study. The developed DNS code (Satake and Kunugi,[6],[8]) adopt 
thermal filed, DNS of the velocity and thermal filed in turbulent pipe flow has 
been carried out. The heating boundary condition is imposed on the wall with 
constant heat flux on the circumferencially facing. The spatially distribution of 
the turbulent structure and scalar flux are presented and studied the transport 
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mechanism of thermal field from scalar flux budgets. In addition, Nusselt num- 
ber as a macroscopic parameter is discussed as the problem of thermal-hydraulic 
design compared with experience equation. 



2 Numerical Procedure 



The DNS code can numerically solve the incompressible Navier-Stokes and con- 
tinuity equations described in cylindrical coordinate using a second-order finite 
volume discretization scheme with the radial momentum flux formulation [9]. 
These equations are integrated in time by using the fractional-step method [10] 
with Crank-Nicholson and a modified third-order Runge-Kutta scheme[llj. Pois- 
son equation for pressure is adopted for the direct FFT solver. The Poisson 
equation in Fourier space is written as 



1_9 

r dr 



dSp 

dr 



^1 = 



1 



l-cos(2^) 









- + kl) Sp = 

,m = 0,N^-l 
n=0,N^-l.{l) 



1 



2f3k^t 



VSii, 



As shown in Fig. 1, the domain decomposition and the transpose algorithms 
were applied to Eq(l). Watanabe et al. [Tjfound that this code can be treated up 
to 6,300,000-grid system on 16 processors of the Fujitsu VPP machine. In case of 
16 processors, the efficiency of parallelization for this code was around 46.6 %. In 
our previous study regarding the turbulent pipe flow (Satake and Kunugi,[6],[8j), 
this DNS code has been shown in good agreement with the existing DNS results. 



3 Computational Conditions 

The computational domain of the fully developed turbulent pipe flow is shown in 
Fig. 2. The number of grid points, the Reynolds number and grid resolutions are 
summarized in Table 1. To perform a DNS with the highest Reynolds number in 
this study, a highest grids size of 1024 x 512 x 768 is adopted for 84GB main 
memory for 64 PEs on a vector-parallel computer Fujitsu VPP 700E at RIKEN. 
A uniform heat-flux was applied to the wall as a thermal boundary condition. 
Prandtl number of the working fluid was set to be 0.71. Further details of the 
velocity boundary condition for pipe geometry DNS can be found in Satake and 
Kunugi [6]. 

4 Results and Discussions 

The friction coefficients and Nusselt numbers are shown in Figs. 3 and 4. All 
results are excellent agreement with Blausius's friction law and empirical cor- 
relation. The present Nusselt number for Rer = 180 obtained is 18.74. Isshiki 
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et al. [2] obtained Nu=20.06. This value is in close agreement with the present 
result. The values for Rer = 150 and 180 are also in good agreement with the 
empirical correlation by Gnielinski [12]. In comparison with the previous DNS 
and experimental data, the mean velocity profiles are shown in Fig. 5. At Rct 
= 180 {Reb =UbD/v =5300), the present results are in excellent agreement with 
the DNS data by Eggels et al. [5]and with the experimental data by Durst et 
al. [3]. Other present results except Rcr =1050 also agree with the experimental 
data obtained by Durst et al. [3]. At Rer = 1050 (Reb =40000), the present DNS 
is in excellent agreement with those of the experimental data by Laufer [4] . 

The Reynolds shear stress profiles are shown in Fig. 6. All the total stress 
show the straight distribution. This is because the time averaging is sufficient. 
Figures 7 to 9 show the velocity fluctuation for each component. The fluctua- 
tions for radial and circumferential direction are energetic, and depends on the 
Reynolds number. The mean temperature profile is shown in Fig. 10. The results 
agree with the experimental data obtained by Isshiki [2] at Rer =180. At Rcr 
= 1050, the logarithmic region clearly observed. The distribution is coincident 
with = 2.18 lny+ -|- 3.0. 

The total heat flux can be obtained as 



-u'r + 0'+ 



1 dO+ 
Pr dr+ 



2 /(f UzT^dr^ 

- 



(2) 



The above profile is computed and the result is shown in Fig.ll. In the vicinity 
of the wall, the profile exists over unity owing to the effect of the circumferencial 
coordinate. Thus, the right hand side of Eq. (2) is multiplied by;;^. Figure 12 
show the root-mean-square temperature fluctuation normalized by the friction 
temperature. The peak velocities of all slightly increase and move near wall 
region with increasing Reynolds number. The temperature fluctuation spreads to 
both the pipe center and the near wall region with increasing Reynolds number. 

The correlation coefficients Ru^urjRu^e and Ru^e are defined as 



RuzUr — 



UzUr 



( 3 ) 



^Uz9 



( 4 ) 






Ur0 



( 5 ) 



The cross correlation coefficients are shown in Fig. 13. The Ruz6 is 0.96 near 
wall. For previous experiment in a pipe flow, Bremhost and Bullock [13] obtained 
Rur9 = —0.47 and RuzUr = —0.43. These value are in good agreement with the 
present data in logarithmic region. The close agreement between Ru^e and RuzUr 
shows that the mechanism of the wall-normal turbulent heat flux is similar to 
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that of Reynolds shear stress one. These similarities exist owing to the molecular 
Prandtl number as 0.71. 

The turbulent Prandtl number is defined as 



Prt = 



u'^+u'^+ 

u'^+9+ 



dO+ 

dU+ ■ 
d r+ 



(6) 



Note that the wall asymptotic value of Prt is independent molecular Prandtl 
number suggested by Antonia and Kim[14]. Kawamura[16] obtained the channel 
DNS data for various Prandtl numbers. The result indicated that Prt is close 
to 1 in the vicinity of the wall. The present results in Fig. 14 also shows to be 
unity and in good agreement with the results of other DNS[15][16]. Because the 
working fluid is air as 0.71. 

The budget of the turbulent kinetic energy is written as 



1 9r+fc+M'+ , , , , dU^ 1 dr^p'+u' + 

r+ dr+ ^ dr+ r+ dr+ 

' V '' V V " 

Turbulent Production Pressure dif fusion 

dif fusion 



r+ dr+ y dr+ ) 

Viscous 
dif fusion 




+ 



du'r+ \ 

dz+ ) 



2 



1 du'^+V { du'^+V f du'^+V f 1 du'^+ 

~ d(j) ) ^\dz+ ) ^\dr+ ) ^ [f+ d<p 




Dissipation 



1 du'^+ 
r+ d(j) 




2 



Dissipation 





The budgets of the above equation are shown in Fig. 15. The production 
and dissipation terms are dominant near wall region. The peak location of the 
production term slightly move to the wall with increasing of Reynolds number. 
The budget of temperature variance kg = 0'+^/2 is derived as 



2 
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The budget of these terms are shown in Fig. 16. In the viscous sublayer, 
the molecular diffusion and the dissipation are dominant and are increased with 
increasing Reynolds number. 
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The peak locations of the production term are almost the same for Rer = 
360, 500, 1050. Note that the peak asymptotic value of turbulent production is 
0.25 in the high Reynolds number flows. In Fig. 17, the result for Rcr = 1050 
indicated that the peak asymptotic value of turbulent production is close to 0.25 
at =15. 

Figure 18 shows the contour of the low speed streaky structures (m+ <- 
3.5)and high temperature region (0+ >3.5). They are normalized by v and Ur- 
The volume visualized is obtained as cutting volume =985, = R'^ — 
r+ =1050,r+(/)=536) for Rcr =1050 from full computational volume (L+ =15750, 
— r+ =1050, L+</>=6597). The width of the large streaky structures is 
larger than r+i))=100 and located at away from the wall. A few streaks merged 
as ’’plate- like structures” at y^=200 from the wall. The velocity and scalar 
fields shows strong analogy. The many tube like structures and large streaky 
structures exist in this volume. The second invariant of velocity gradient tensor 
(Q+ <0.008), the ejection (— C/+R+ <0, R+ >0, R+ = — m^) and the sweep 
1-U+V+ <0, R+ <0, R+ = -u+) for Rcr = 1050,500,360,180 in the cross 
section perpendicular to the circumferential direction are shown in Fig. 19 (a)- 
(d), respectively. In low Reynolds number {Rcr =180), the ejection and the 
sweep are observed very small region and located around vortical structures. 
However, in high Reynolds number {Rer = 360, 500, 1050), the structures seem 
to be rather different from one in low Reynolds number. A characteristic sizes 
of the ejection and the sweep are even larger than the half of the pipe radius. 
Almost large structures located in >200 correspond to the wake region in 
the mean velocity profile. In the region, the scales of fluid motion are different 
from that of the near wall region. 

5 Conclusion 

DNS on a turbulent pipe flow with heated wall was carried out for five Reynolds 
numbers. The present results are in good agreement with the previous experi- 
mental data. The temperature bariance and the scalar-flux budget terms at Rer 
= 1050 obtained are most enegetic close to the wall. From the numerical visual- 
ization results, it is found that the velocity streaks are merged at y+=200. More 
details temperature vizualization result for Rer =1050 will be reported in the 
presentation. 
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Table 1. Computational condition 
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Domain decompositions 




Transpose 







(a) FFT (b) TDMA 

Fig. I Domain decomposition technique 




Fig. 2 Computational domain 




Fig. 3 Skin friction 





Fig.5 Men velocity profiles 



Fig. 4 Nusselt number 



Reyiwlds shear stress and Total stress 
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Fig.6 Reynolds shear stress 




Fig.7 Streamwise turbulent intensity 





Fig. 10 Mean temperature 




Rg. 1 1 Total and Scalar flux 




Fig.l2 Temperature fluetuation 
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Fig.l4 Turbulent Prandtl number Fig-16 The budget of temperature variance 




Fig. 15 Turbulent kinetic energy 




Fig. 17 The maximum value of turbulent 
kinetic energy 




Fig. 18 the contour of the low .speed streaky structures (i/*<-3.5); blue and the contour of the 
high temperature structures ( ^>3.5); Red 
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Fig.l9 The second invariant of velocity gradient tensor (^*<0.008), the ejection (11*1^ <0, V* 
>0, =- upland the sweep (t//K<0, <0,1^ =-u/)for/?e =180,360,500,1050 in the cross 

section perpendicular to the circumferential direction 
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Abstract. We have developed the Progressive Parallel Plasma (P^) 2D 
code tuned for the massively parallel scalar computer 'Intel Paragon 
XP/S 75MP834'. The computer is composed of 834 nodes and each 
node has 1 communication and 2 calculation CPUs and 128Mbyte 
random access memory (RAM). The computer has a total performance 
of 100GByte RAM and 120GFLOPS. The scheme for parallelization is 
domain decomposition and information of the particles crossing the 
node's boundary is communicated to 2 neighbor nodes. The 
performance of the calculation for plasma particle simulations is 53 
nano seconds/simulation time step/particle, which corresponds to 
effectively 42GFLOPS. By using the P^ 2D code, a simulation including 
1 Giga particles can be completed within a few days. 



1 Introduction 

The development of short -pulse ultra high intensity lasers) 1] has opened a new regime 
in the study of laser-plasma interaction. Depending on the type of matter and laser 
parameters, various photon generation and particle acceleration mechanisms have 
been invoked in the different regimes of the laser plasma interaction. Recently, there 
has been a great deal of research devoted to generation of higher harmonics and X-ray 
by ultra high intensity lasers. Strong ion acceleration like as the Coulomb explosion 
[2] is associated with the break of the plasma quasineutrality when the electrons are 
expelled from a self-focusing radiation channel in the plasma after which the ions 
expand due to the repulsion of the noncompensated electrical charge [3]. The 
Coulomb explosion has also been invoked in order to describe the generation of fast 
ions during the interaction of high intensity laser pulses with clusters [4]. 

In the interpretation of the experimentally observed acceleration of the ions, it was 
assumed in Ref [5] that the ions move radially with respect to the channel under the 
effect of the Coulomb explosion. However, the generation of fast ions not only 
radially but also in forward direction were observed in 2D Particle In Cell (PIC) 
simulations [6]. In the case of an overdense plasma the role of the channel is taken as 

M. Valero et al. (Eds.): ISHPC 2000, LNCS 1940, pp. 524-534, 2000. 
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the hole bored by the laser pulse. In such plasmas, in addition to the plasma expansion 
in vacuum mentioned above and to the ion expulsion in the transverse direction due to 
the self-focusing channel, we also notice ion acceleration in the plasma resonance 
region [6] and forward ion acceleration if the laser pulse interacts with a thin foil [7]. 
The latter results were obtained via PIC simulations in the framework of a one 
dimensional planar model which is valid as long as the transverse size of the laser 
pulse is much larger than the acceleration length. Since in a one-dimension planar 
model the electrostatic potential diverges as the width of the ion cloud increases, we 
must perform at least two-dimensional simulations. In addition, an ultraintense laser 
pulse in a near-critical density plasma and in an overdense plasma, is subject to 
relativistic self focusing, the description of which also requires at least two 
dimensional PIC simulations. 

The Numerical study of the interaction of an ultraintense laser with matter needs to 
be performed in at least two-dimensional simulations. However the computational cost 
of such a simulation is very high. For example, in the case of electron density of 
6.2*10^^ cm'^, ie. A1 the mesh size must be smaller than 2nm because the skin 
depth is 7nm. In a few lO's of femto seconds, ions and electrons expand to around 
lOpm from the irradiation point of the matter. Therefore, a 10000*10000=0.10 mesh 
simulation is required and 1 Giga particles must be contained in a simulation box to 
investigate properties of the accelerated ions and electrons, i.e. energy and spatial 
distribution etc. 



2 Intel Paragon 75MP834 

Parallel calculations with the use of over hundreds computers were performed in 
Japan since several years ago. The Intel Paragon XP/S 75MP834 <Kansai Research 
Establishment> and 15GP256 <Naka Fusion Research Establishment> were 
introduced as pioneers in Japan Atomic Energy Research Institute for the purpose of 
the massively parallel calculations for advanced photon and fusion research. Recently, 
a lot of parallel programs have been transplanted and newly produced to perform the 
parallel calculations with the computers. Therefore, there are some seeds of trouble in 
the massive parallel computing. When programs are developed under different 
computer and operating system, prudent directions and knowledge are needed. 
However, integration of knowledge and standardization of environment are quite 
difficult because of number of Paragon system. There are a few codes for massive 
parallel computing. 

We have developed the P^ 2D Code tuned up for the massively parallel scalar 
computer 'Intel Paragon XP/S 75MP834'. The computer is composed of 834 nodes 
and every node has 1 communication and 2 calculation CPUs and 128Mbyte RAM. 
The Paragon has 26 I/O nodes independent of 834 calculation nodes. Each I/O node 
has 64Mbyte RAM and the effective I/O ability of 2.6Mbyte/s. Note that every I/O 
node cannot accept more than 32 EO requests. The parallel file system (pfs) is 
constructed with 16 I/O nodes. The computer has total performance of 100GByte 
RAM and 120GFLOPS. The scheme for parallelization is domain decomposition and 
information of particles crossing the node's boundary is communicated to 2 
neighboring nodes. The performance of the calculation for plasma particle 
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simulations is 53 nano seconds/simulation time step/particle, which corresponds to 
effectively 42GFLOPS. By using the 2D code, a simulation including 1 Giga 
particles can be completed within a few days. 
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Fig. 1 The scheme for parallel, one dimensional domain decomposition and communication 
to 2 neighbor nodes 



3 Progressive Parallel Plasma (P^) 2D Code 

To perform the simulation on a quite large scale, several techniques are used in the P^ 
code. The first is a technique to use RAM effectively. The technique is called as 
“variable dimension”, which is Fortran’s programming in the old style. As shown in a 
Program. 1 and Fig. 2, a large matrix is defined in the main program and is divided into 
a lot of variable pieces in the each subroutine. The method has some advantages for 
RAM management, for example, saving of the needed RAM and systematically 
understanding the utilization of RAM. By the technique, scale of calculation is 
improved up to 300% and IGiga-particle simulation has been realized. 

dimension AT(1 :4000) 

ip1=1 

ip2=1000 

ip3=2000 

ip4=3000 

ip5=2000 

ip6=2500 

ip7=3200 

ip8=3500 

caii sub1 (AT(ip1 ),AT(ip2),AT(ip3),AT(ip4)) 

caii sub2(AT(ip1 ),AT(ip2),AT(ip5),AT(ip6),AT(ip7)) 

stop 

end 
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subroutine sub1(FM1,FM2,TM3,TM4) 
dimension FMin,FM2n,TM3(*),TM4n 

• 

return 

end 



subroutine sub2(FM1 ,FM2,TM5,TM6,TM7) 
dimension FM1(*),FM2(*),TM5(*),TM6(*),TM7(*) 

• 

return 

end 



Program 1 Method of the variable dimension 



subl subZ 




The second is a technique to prevent cash miss-hits. In the previous large-scale 
simulation, the programming for a vector processor was adopted. Because the Paragon 
has massively scalar processors, the programming is changed into that of a scalar 
processor as following. 
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Vecotr programming 
doj = 1,Nmax 
kx = x(j) 
ky = yG) 

UX = PxG) 
uy = PyG) 
uz = PzG) 



Scalar programming 
doj = 1,Nmax 
kx = phase(1,j) 
ky = phase(2,j) 

UX = phase(3,j) 
uy = phase(4 j) 
uz = phase(5,j) 



enddo enddo 

Program 2 Comparison of programs for vector and scalar computers 



By the technique, calculation speed is improved up to 45%. 

The third is a technique to parallelize two processors in the single node. As shown 
in Program. 3, the calculations, especially access to RAM, in two processors should be 
independent each other. This is a unique technique for the Paragon MP series, 
although the technique may be used in the computers which have common RAM with 
multi-processors. The sample program is accelerated to about two times faster than 
that of single processor. By the technique, calculation speed is improved up to 45%. 

c directive for two processors in the single node 
cdir$l cncall 

do iprocessor =1,2 
if(i.eq.l) then 

c calculation of Phase(1)DPhase(Ndiv-1) at processorl 
call particle1(Phase(1),J1) 
else 

c calculation of Phase(Ndiv)DPhase(Nmax) at processor2 
call particle2(Phase(Ndiv),J2) 
end if 
enddo 

c cancell directive for two processors in the single node 
cdir$l nocncall 
c sum up J1=J1+J2 
call sumup(J1,J2) 

Program 3 Parallelization for two processors in the single node 

The forth is a technique to parallelize I/O from a lot of calculation’s nodes. The 
technique is based on Nx-library[8,9], which is made by INTEL for parallel 
computing. The Paragon XP834/MP has 834 calculation nodes and 26 I/O nodes. 
Each I/O node has 64Mbyte RAM and a speed of 2.6Mbyte/s. Every I/O node cannot 
accept more than 32 I/O requests. The parallel file system (pfs) is constructed with 16 
I/O nodes. The I/O speed is achieved 40Mbyte/s with 16 splitting pfs and parallel 
writing at 800 node by the following program. 
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c irw=(number of nodes)/(number of I/O requests) 
irw=800/16 

c gopen(logical unit, file path, option) 
call gopen(id,file_path,M_RECORD) 
do i= 1, irw 

c control of 16 I/O requests in the I/O node. 

□ if ( mod(nodenumber,irw).eq.ir-1 ) then 
c cwrite(logical unit,output variable, output data size) 
call cwrite(id. Phase, idatasize) 
end if 
c barrier 

call gsync() 

enddo 

Program 4 Parallelization for I/O in the massively node 

In the ordinary PIC simulation, a correction of electricfield by the fast Fourier 
transform (FFT) was needed to satisfy the Coulomb’s law in the Maxwell’s equations. 
However, the cost of needed RAM size and communication traffic is large with the 
use of the FFT. In the code, a rigid local solver [10] is adopted to overcome the 
difficulty. 

As a result, the calculation speed of 53ns/step/particle is achieved in 800 nodes and 
the acceleration rate as a index of parallel computing is about 800. This is 
corresponding to 42GFLOPS, which is 35% of the maximum hardware 
specification,120GFLOPS. By the way, the calculation speeds in NEC-SX4-16CPU 
and Fujitsu-VPP300-lCPU are 340ns/step/particle and 4660ns/ step/particle, 
respectively. 

To simulate an interaction of a real solid with a relativistic laser, ionization, 
collision and radiation processes need to be calculated. In the code, ionization and 
collision processes were simulated by Monte-Carlo method. Bremsstrahlung and 
Larmor radiation were estimated by post process. 




Fig. 3 The flow diagram in P^ 2D code 

In the large-scale simulation, real-time visualization and steering system is thought 
as hopeful method of data analysis. This approach is valid in the fixed analysis at one 
time. In the simulation research for an unknown problem, it is necessary that the 
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output data can be analyzed many times because profitable analysis is difficult at the 
first time. Consequently, output data should be filed to refer and analyze at any time. 
The pseudo-real-time visualization system is equipped in the code. The support 
system has the followed automatic functions, 

1) make directory in the Paragon, file server and graphical work station 

2) transport files from the Paragon to the file server 

3) create CLI or V scripts of AVS5 or AVS Express 

4) execute of AVS 

5) convert image format and re-arrange them 




Fig. 4 Illustration of the P^ support system 

In the future work, we are going to add to the P^ support system the following 
functions, 

1) the files automatically file backup and restore 

2) link database server and manage data and image files 
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4 Higher Harmonic Generation by Interaction of The Thin Solid Hydrogen 
with Use of The Code and Its Support System 



Interaction of the thin solid hydrogen has been studied by the code and the support 
system. The simulation has be performed with the code in the condition as low 
temperature, (10 ~ lOOeV) and higher harmonics ( keV X-ray ) observation. The laser 
pulse is gaussian along the x and y axes with full width half maximum 4|im. The ion 
density corresponds to the real solid hydrogen, 6.2*10^^ cm'^. The laser condition is 
set to be as shown in Fig. 5. Figs. 6 and 7 are the result of the IG particle simulation of 
a interaction with relativistic laser ao=10 with the P^ 2D code. Figs. 6 and Fig.7 show 
the time variation of the electric field in the polarization direction and the spectrum 
(kx,ky) of the transmitted light back of the thin foil, respectively. In Fig.5, the 
directions of surface and depth of the foil are perpendicular and horizontal, 
respectively. 
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Fig.5 Illustration of the geometry for the simulation with a hydrogen. 
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Fig. 6 shows that the higher harmonics (-IOcbl) in corresponding to the plasma 
frequency are generated strongly through the thin solid target. The strength of the 
higher harmonic field is a quite strong, which is tenth times as larger as that of the 
laser. As shown in Figs. 7, higher harmonics below the plasma frequency are strongly 
suppressed in the transparent light. Only higher harmonics over twice as large as the 
plasma frequency are transmitted from the back of the foil. Separately from emissions 
from atomic processes induced by the intense laser, the radiation can become short- 
pulse and intense because of direct process in the interaction with the laser. The pulse 
length of the radiation is the same order as that of the laser. The result indicates that 
higher harmonics are generated in front of the foil. In fact. The spectrum of the 
reflected light in the front of the foil do not have the suppression. 
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Fig. 6 Time evolution of the electric field (Ey) in the polarized direction. 

The interval of the snap shots is 12 femto seconds, and the snap shot on the right 
side is at the end of the simulation. 



5 Conclusions 

We have developed the 2D code and its support system tuned up for the massively 
parallel scalar computer 'Intel Paragon XP/S 75MP834'. The scheme for parallel was 
one dimensional domain decomposition and information of particles crossing the 
node's boundary is communicated between 2 neighboring nodes. The performance of 
the calculation for plasma particle simulations is 53 nano seconds / simulation time 
step /particle, which corresponds to effectively 42GFLOPS. By using the P^ 2D code, 
a simulation including 1 Giga particles can be completed within a few days. In the 
future work, we are going to add to the P^ support system the following functions. 
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Figs. 7 Two dimensional wave number of the transmitted and reflected light. 



1) files automatically backup and restore 

2) link database server and managing data and image files 

With the 2D code and its support system, it is found that only higher harmonics 
over twice as large as the plasma frequency are transmitted from the back of the foil. 
The higher harmonics below the plasma frequency are strongly suppressed in the 
transparent light. The result indicates that higher harmonics are generated in front of 
the foil. Separately from emissions from atomic processes induced by the intense 
laser, the radiation can become short-pulse and intense. The pulse length of the 
radiation is the same order as that of the laser. 
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Abstract. The use of a three-dimensional PIC (Partiele-in-Cell) 
simulation is dispensable in the studies of nonlinear plasma physies, 
sueh as ultra-intense laser interaetions with plasmas. The three- 
dimensional simulation requires a large number of partieles more than 
10^ partieles. It is therefore very important to develop a parallelization 
and veetorization seheme of the PIC eode and a visualization method of 
huge simulation data. In this paper we present a new parallelization 
seheme suitable for a present day supereomputer and a eonstruetion 
method of seientifie eolor animations to analyze simulation data. We 
also diseuss the advantage of the Abe-Nishihara veetorization method 
for a large seale PIC simulations. 

Most of supereomputers in present day eonsists of multi nodes and 
eaeh node has multi proeessors with a sheared memory. We have 
developed a new parallelization seheme in whieh domain 
deeomposition is applied among nodes and partiele deeomposition is 
used for proeessors within a nodes. The domain deeomposition in PIC 
requires the exehange of two kinds of data between neighboring 
domains. One is partiele data, sueh as partiele position and veloeity, 
when a partiele erosses the boundary between the neighboring domains. 
The other is field data, sueh as eleetrie field intensity and eurrent 
density in the boundary region. In the three-dimensional Eleetro- 
magnetie PIC forty two-dimensional variables are transferred to the 
neighboring domain for eaeh boundary surfaee. MPI (Message Passive 
Interfaee) has been used for the transmission of these data between the 
nodes. The partiele and field data are respeetively stored onee in one- 
dimensional data and they are then sent to the other node. This reduees 
the number of eommunieation. The partiele deeomposition is performed 
by using auto-parallelization of do-loop. We measured the sealability of 
the layered parallelization seheme for the partiele number of 25,600,000 
and the mesh number of 128x128x128 with the use of sixteen 
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processors of NEC SX-5. The layered parallelization is shown to 
provide a scalable acceleration of computation for the large system of 
PIC simulations. 

In the PIC code it is necessary to calculate the current and charge 
densities from particle position and velocity. For a example, the 
calculation of the charge density of a mesh requires the assignment of 
particle charge on a mesh from its position. Since more than two 
particles may locate in the same mesh, the calculation of the assignment 
can not be vectorized. However for a large system of PIC simulation the 
possibility that more than two particles within a vector register locate in 
the same mesh becomes very small. Therefore even if the vector 
operation is enforced for the calculation of the charge assignment, the 
correction of the calculation is very seldom. The particle’s number of 
which charge is overwritten by the forced vector operation can be easily 
found by checking the particle’s number stored. This is the Abe- 
Nishihara method. We obtained 99% of vectoraization even for list 
vector calculation in PIC by using Abe-Nishihara method. As a result 
0.73 sec per cycle is achieved for the number of particles, 2.56x10^ and 
meshes with (128)^ by using sixteen processors of NEC SX5. Within 
our knowledge, this is the fastest speed achieved at present for this size 
of program. 

Most of advanced visualization software requires interactive 
operations that is not suitable for drawing many pictures from 
simulation data to make scientific animations. It is desirable for 
drawing pictures to be performed by a batch job. We have developed a 
program that during the computer simulation geometry objects are 
constructed at the same time by using A VS library. Digital video files 
are constructed from the geometry objects with the use of the script 
function of AVS animation encoder on SGI by a batch job. 

For modem high power lasers, an intensity of laser radiation can be 
of the order of 10*^ - 10^^ W/cm^. In this ultra relativistic regime, when 
the quiver energy of an electron in the laser light becomes much greater 
than the electron rest mass, a novel nonlinear physics comes in our 
grasp. In this paper we present our resent studies on the ultra-intense 
laser interaction with overdense plasmas with the use of a three- 
dimensional PIC code, EMPAC-3D (Three-Dimensional Electro- 
Magnetic Particle Code). The three dimensional regime of the laser- 
plasma interaction reveals novel features of the laser light. 
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Abstract. Although texture -based methods provide a very promising 
way to visualize 3D vector fields, they are very time-consuming. In this 
paper, we introduce the notion of “significance map”, and describe how 
significance values are derived from the intrinsic properties of a vector 
field. Based on the significance map, we propose techniques to 
accelerate the generation of a line integral convolution (LIC) texture 
image, to highlight important structures in a vector field, and to 
generate an LIC texture image with different granularities. Also, we 
describe how to implement our method in a parallel environment. 
Experimental results illustrate the feasibility of our method. 



1 Introduction 

Visualizing 3D vector data fields in an intuitive and psychologically meaningful 
manner is a very challenging topic in scientific visualization. Traditionally, flow 
fields are usually displayed by inserting geometrical primitives such as arrows, 
particles, streamlines or stream surfaces. Due to inadequate sampling of vector fields, 
it often either results in cluttering images or fails to capture significant features of 
flows. Recently, Line Integral Convolution (LIC), which was proposed by Carbral 
and Leedom in 1993 [1], has been attracting much attention as a powerful fexture- 
based vector field visualization method. Since a texture possesses shape information 
as well as color attributes, it has a great potential to visualize 3D vector fields. The 
texture deformation provides a straightforward cue for the direction of a vector field. 
Besides, since a texture is calculated at each pixel, the traditional sampling problem in 
flow visualization can be avoided. 

LIC is a procedure that smears a given image along paths that are dictated by a 
vector field [1]. It is local, one-dimensional and independent of any predefined 
geometry or texture, and is capable of showing the vector directions even in the area 
where they change quickly. Much work has been done on extending its scope, 
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usefulness, quality and efficiency. Stalling and Hege [2] succeeded in a fast 
implementation of LIC. Their algorithm also allows the resolution of output images to 
be chosen independent of the size of vector fields. Forssell and Cohen [3] extended 
LIC for visualizing the flow on a 3D curvilinear grid. However, this method involves 
the defect of texture distortions caused by the non-isometric mappings between the 
computational space and the physical space. Mao, et al. [4] presented a method for 
performing the convolution operation directly in the physical space based on solid 
texturing and ray casting in order to generate LIC images without any artifacts due to 
misaligned local texture grids or the image interpolation during rendering. Shen, 
Johnson and Ma combined dye advection with 3D LIC to visualize global and local 
flow features at the same time [5]. Shen and Kao presented a new LIC 
method — UFLIC [6] for visualizing unsteady flows. By adopting a time-accurate 
value depositing scheme and a successive feed-forward method, UFLIC can produce 
highly coherent animation frames and trace the dynamic flow movement accurately. 
Okada and Kao [7] presented an enhanced LIC method, which uses post-filtering 
techniques to sharpen the LIC output and highlight flow features. Kiu and Banks [8] 
used multi-frequency noise inputs for LIC to enhance the contrasts among regions 
with different velocity magnitudes. Wegenkittl, Gdller and Purgathofer [9] introduced 
a very expressive technique, which makes use of an asymmetric filter kernel with a 
low-frequency noise input texture rather than the typical high-frequency one to 
enhance the perception of the orientation of vector fields. Verma, Kao and Pang [10] 
presented the PLIC method, in which by adjusting a small set of key parameters, flow 
visualizations that span the spectrum of streamline-like to LIC -like images can be 
generated. 

Most of the existing texture-based vector field visualization methods treat every 
pixel equally. This implies that details are uniformly distributed throughout the 
texture space, thus leading to a fixed detail over the entire texture space without any 
designated highlights. In fact, it is quite common for a flow field to have extremely 
heterogeneous distribution of details. Since each pixel has the same significance value 
in the previous methods, it is very time-consuming to generate a finer image if we 
want to visualize significant structures with enough precision. For the area containing 
less detail of a vector field, we need not take much time to calculate its texture very 
carefully. In this paper, we present a significance-driven texture-based vector field 
visualization method to ameliorate the problem. We introduce the “significance map”, 
which is derived from both intrinsic properties of a vector field and user-guided 
highlights. We describe how the significance map is used to improve the generation 
of texture images, including accelerating the texture image generation, highlighting 
significant structures, and generating multi-granularity texture images. Finally, 
experimental results are given to illustrate the feasibility of our method. 



2 Preliminaries 

As the preliminaries to the description of the new algorithm in succeeding chapters, 
we give herein an overview of the original LIC algorithm presented by Cabral and 
Leedom [1]. 
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Given a vector field, the LIC algorithm takes as input a white noise image of the 
same size with the vector field, and convo lutes the image at each pixel with a ID filter 
kernel defined along the local streamline in the vector field. As shown in Fig. 1, for a 
pixel p{x, y) , the local streamline is calculated by integrating forward and backward 
along the local vector direction. 

The forward integration is performed in the following way: 
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Fig. 1. Local streamline for a point P(x,y) in a 
2D vector field 



calculated simply by taking the opposite direction of the vector at each point. Now the 
output pixel value at the pixel P(x, y) can be represented as follows: 



Fou, (P) = — ; f 



( 2 ) 




k(w)dw 



where = 0, 5 ,. = 5 ,_, + A^,. , 

tt(w) :The convolution kernel, 

^n(L^J) • input pixel value at pixel , 

/,/ : The number of pixels which streamline passes through. 
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Note that the local streamline generated here depends only on the direction of vectors, 
but ignores the magnitude. More accurate streamline calculation is given in [2]. 



3 Significance Specification 

3.1 Significance Valnes Derived from Topology Analysis of Flow Field 

In the original LIC method, the selection of the convolution length for each pixel is 
crucial. If a long length is selected, much more calculation is needed. But if it is too 
short, the vector direction cannot be revealed. In most of the previous papers, the total 
convolution length is selected to be proportional to the vector magnitude at each 
pixel, that is, length = k\v \ , where |v| is the magnitude of vector v at a pixel. However, 

for how to select the coefficient k, no one gave the convincing answer. In order to 
obtain a good result everywhere, a long integration length is usually adopted. In our 
opinion, it is better to determine the integration length according to the relative 
significance coefficient for each region in the vector field. We regard the region that 
contains interesting structures, such as vortex, as significant area, and take its 
integration length in the original way to get a finer texture. On the other hand, for the 
region that contains little interesting detail, we regard it as insignificant area, where 
we can shorten the convolution length much to accelerate the image generation. 
Although the flow direction may not be very clear in the area on the final image, it 
has much less influence upon the quality of the entire image because the users have 
much less attention on the area. Furthermore, in the insignificant area, we need not 
calculate a texture at each pixel. We can calculate a smaller number of pixels to get an 
image with coarser texture granularities, which also can accelerate the LIC method. 

In order to determine the significance value at each point in a given flow field, we 
employ the vector field topology analysis technique developed by Chong, et al. [11]. 
Topological concepts are very powerful in the analysis of flow fields. They are based 
on the critical point theory, which has been used widely to examine solution 
trajectories of ordinary differential equations. The topology of a vector field consists 
of critical points (where the velocity vector is zero), and integral curves and surfaces 
connecting these critical points. From this, we can infer the shape of other tangent 
curves, and hence to some extent the overall structure of a given vector field. 

The positions of the critical points can be found by searching all cells in the flow 
field. They occur only in cells where all the three components of the vector pass 
through zero. The exact position of a critical point can be calculated by interpolation 
in case of a rectangular grid. The position of a critical point in case of a curvilinear 
grid can be calculated by recursively subdividing the cells or by a numerical method 
such as the Newton iteration. 

Once the critical points have been found, they can be classified by approximating 
the velocity field in their neighborhood with the first order Taylor expansion. Thus, 
for a non-degenerate critical point , Zg ) , we can use the matrix of these 

derivatives, that is, the Jacobian matrix, to characterize the vector field and the 
behavior of nearby tangent curves [12]. 
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The eigenvalues and eigenvectors of this matrix determine the feature of the flow 
field near the critical point (xg,yg,Zg). Positive eigenvalues correspond to velocities 



away from the critical point (called as repelling nodes), and negative eigenvalues 
correspond to velocities towards the critical point (called as attracting nodes). 
Complex eigenvalues result in a focus. If the real part is non-zero, a spiral occurs 
(shown in Fig. 2(a), (d)), whereas if the real part is zero, concentric ellipses occur 
(shown in Fig. 2(f)). If we have both negative and positive real values at a critical 
point, the critical point is a saddle (shown in Fig. 2(c)). 





(c) 




(f) 



Fig. 2. Examples of three dimensional critical points (a) repelling focus, also repelling in the 
third dimension; (b) repelling node; (c) saddle, repelling in the third dimension; (d) attracting 
focus, repelling in the third dimension; (e) attracting node; (f) center, repelling in the third 
dimension 



However, the flow topology analysis sometimes cannot locate all flow features in 
3D data sets. Vortex core is known as a very important feature in flow fields, but up 
to date, it has no formal definition. Some researchers attempted to define it as the 
center of swirling flows. Obviously, some center points of swirling flows are not 
located at critical points because they have a velocity in the principle direction of 
swirling fields. Much work has been done on finding vortex lines [13][14][15]. Sujudi 
and Haimes [13] defined the vortex core line as the set of places where what they call 
reduced velocity is zero. The reduced velocity is defined as the velocity minus its 
component in the direction of the real eigenvector. In the method, only the regions 
having complex eigenvalues are considered, so there is always a single real 
eigenvalue, and the corresponding eigenvector direction is unique. But this method 
has problems with vortices that are curved. It succeeds only for very strong vortices or 
almost straight ones. Roth and Peikert [15] improved this method by using a higher- 
order derivatives to make it possible to handle the case of bent vortices. 
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Based on the above theory, we cannot find critical points in the 3D data set (that 
will miss some vortex cores). Therefore, we take an approach to find the critical 
points on 2D cross-sections, and then classify them by the flow topology analysis 
technique. This may exclude some points that have two zero components on the 
cross-sections without spiral property. Obviously, if some cross-sections of a flow 
field contain critical points that have two complex eigenvalues of its Jacobian matrix, 
the critical points must be very significant on the cross-sections of the flow field. 
Thereby, the areas near the critical points should contain much more details than 
others, and should be assigned relative higher significance values. The areas far away 
from all critical points are assigned relative lower significance values. 

The steps for constructing the significance map according to the topology analysis 
are as follows. First, for each cross-section, find all the critical points. Then, classify 
these critical points. Find repelling focus, attracting focus and concentric critical 
points, and store them in an array criticalpoint_on_section. Finally for each cell, 
calculate the sum of distances from the cell to all points in the array 
criticalpoint_on_section, and map the sum to a significance value. 



3.2 User Specification 

Sometimes, the users may be interested in some area that does not contain any 
significant structures of a flow field in terms of the field topology analysis. Therefore, 
user intervention is also necessary for significance specification. The users can 
explicitly specify regions of higher or lower significance with a significance brush to 
define a significance map. User-defined significance may also be combined with the 
topology-based significance to take advantages of both specifications. 



4 Texture Image Generation Based on Significance Map 

4.1 Accelerating Texture Image Generation 

From the significance map, we can get the significance value at each cell in the 
texture space. Then, we can shorten the convolution length in the areas with lower 
significance values. Our work could accelerate even for FLIC presented by Verma 
and Kao [10], which takes textures from the template directly instead of convolution. 
In the PLIC method, most of time is consumed on getting a streamline at each seed 
pixel. Therefore, we can accelerate it by tracing shorter streamlines in the areas with 
lower significance values. Also, we can control the seed points’ positions and 
streamlines’ thickness according to the significance map. Meanwhile, for the PLIC 
method, many streamlines intersect at the same pixel when the streamline is long and 
dense, so in insignificant areas, we can let each pixel contributed from a single 
streamline to get an inexact texture with a less computational cost. Since the number 
of critical points is usually very limited, our method can obviously accelerate the 
texture image generation with little loss of quality. If a flow field is very complex and 
has important details almost everywhere, this method is certainly of no effect. 
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4.2 Highlighting Significant Strnctnres and Details 

By the significance map, we can highlight significant structures and details. The input 
noise texture can be pre-multiplied by a function of the significance values to 
concentrate the highest input texture opacities in significant regions, and to reduce the 
input noise texture opacities to values approaching zero where the area is relatively 
insignificant. 



4.3 Generating Mnlti-Grannlarity Textnre Images 

We can select different granularities of texture according to the significance map of a 
flow field. A coarser granularity is selected in the lower significance area, and a finer 
granularity is selected in the higher significance area. Since we can involve a smaller 
number of cells for the computation in the area with coarse granularity, this method 
also can accelerate the LIC image generation. 



5 Implementation and Results 

The present method is very easy to be implemented in a parallel environment because 
the feature analysis works on a cell by cell basis. The image is subdivided into several 
subdomains that are assigned to different PEs. First, each PE respectively finds the 
critical points, classifies them, and gets the coordinates of focus and concentric 
critical points. Then, all the slave PEs send their results to the master PE. The master 
PE collects all the focus and concentric critical points, and sends them back to each 
slave PE. Next, each PE calculates the significance map in its subdomain according to 
all focus and concentric critical points. Finally, each PE determines pixel intensities. 
For the pixels near the boundary of their subdomains, the PEs need to access vector 
field data from neighboring subdomains. To minimize the communication, we 
replicate some boundary vector data. 

We have applied the method to a 3D irregular flow field, which simulates a flow 
passing through an ellipsoidal cylinder. The 3D data set has 25,790 grid cells. Fig. 
3(a) is a vertical cross-section texture image with the resolution of 450x300 , which 
was generated by the original LIC method. The cross-section has three critical points 
in terms of the field topology analysis. We marked them on Fig. 3(b), one of which is 
a saddle critical point located on the horizontal centerline of the image, and two of 
which are attracting focus critical points, symmetrically located on the two sides of 
the horizontal centerline. Because the saddle points have no focus, we have not 
considered them as significant critical points. In Fig. 3(c), we mapped the significance 
value at each pixel to a gray value. The darker point corresponds to lower significance 
value. Fig. 3(d) was generated by our significance-driven LIC method. We adopted 
very short convolution lengths in the regions with low significance values. The speed 
was 5 times faster than using the original LIC method. Although the image is a little 
blurred in the areas far away from the vortices, we keep almost the same quality of 
visualization. Meanwhile, the important structures of the flow field highlighted with 
the significance map can attract users’ much attention. Fig. 3(e) is the result by taking 
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different texture granularities according to the significance map. We can see that 
almost no detailed structures in the vector field were lost although we take coarser 
texture granularities in some insignificant regions. We generated this image almost 4 
times faster than using the original method. 



6 Conclusions and Future Work 

In this paper, we have presented a fast LIC image generation method using the 
significance map. We employed the topology analysis technique for vector fields to 
generate the significance map. The significance map can also be combined with users’ 
specification. By using the significance map as the basic reference structure, we 
developed the techniques to accelerate the texture image generation, to highlight the 
significant structures, and to generate a texture image with different texture 
granularities. Experimental results proved the feasibility and effectiveness of our 
method. 

LIC method often fails for 3D volume due to its dense 3D texture. In future, we 
will use the significance map to control the transfer functions for comprehensible 
volume rendering, highlighting the significant regions and neglecting insignificant 
information. 
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Fig. 3(b). Three critical points on the cross- 
section: One is a saddle critical point, and 
the other two are attracting focus critical 
points 



Fig. 3(a). A cross-section texture image 
generated by the original LIC method 
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Fig. 3(c). A gray-value coding of the 
significance map 



Fig. 3(d). A texture image generated by 
our significance-driven LIC method, 
which not only accelerates the image 
generation, but also highlights the 
vortices 




Fig. 3(e). A texture image generated with 
different texture granularities 
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Abstract. Isosurface generation algorithms usually need a vertex- 
identification process since most of polygon-vertices of an isosurface 
are shared by several polygons. In our observation the identification 
process is often costly when traditional search algorithms are used. In 
this paper we propose a new isosurface generation algorithm that does 
not use the traditional search algorithm for polygon-vertex 
identification. When our algorithm constructs a polygon of an 
isosurface, it visits all cells adjacent to the vertices of the polygon, and 
registers the vertices to polygons inside the visited adjacent cells. The 
method does not require a costly vertex identification process, since a 
vertex is registered in all polygons that share the vertex at the same 
time, and the vertex is not required after the moment. In experimental 
tests, this method was about 20 percent faster than the conventional 
isosurface propagation method. 



1 Introduction 

Isosurface generation is one of the most effective techniques for extracting features of 
a scalar field in a volume data, such as the results of numerical simulation or medical 
measurement. Discussion of efficient isosurfacing methods has therefore been very 
active. Many approaches have been reported for the acceleration of isosurface 
generation, such as parallelization [1], graphics acceleration by generating triangular 
strips [2], and geometric approximation [3]. The most popular approach is to skip 
non-isosurface cells. Many reported algorithms sort or classify cells according to their 
scalar values [4,5,6,?]. Other algorithms that use the spatial-subdivision algorithms 
have been also proposed [8,9]. The present authors have proposed extrema-based 
algorithms [10,11] that efficiently search for isosurface cells. Starting from the 
extracted isosurface cells, an isosurface is generated by recursively traversing 
adjacent cells [12]. 

The above-mentioned algorithms have drastically reduced the unnecessary cost of 
visiting non-isosurface cells. Table 1 shows the cost of generating 20 isosurfaces with 
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different iso-values in an unstructured volume consisting of tetrahedral cells. The 
experimental test compares a straightforward algorithm (ST) that visits all cells and 
the volume thinning algorithm (VT) [11]. Here, 

— and N„ denote the numbers of cells and nodes in a volume. 

— N, and K denote the total numbers of triangular polygons and vertices in the 20 
isosurfaces. 

— Tj denotes the computational time of visiting non-isosurface cells. 

— T 2 denotes the computational time of visiting isosurface cells and constructing the 
topology of polygons. 

— Tt, denotes the computational time of calculating polygon-vertex data, such as 
positions and normal vectors. 

— denotes the total time of generating isosurfaces. 



Table 1. Computational times of processes in generating isosurfaces 



Dataset 


1 


1 


2 


2 


Method 


SF 


VT 


SF 


VT 


Nc 


61680 


61680 


346644 


346644 


N„ 


11624 


11624 


62107 


62107 


N, 


80995 


80995 


135358 


135358 


N, 


43158 


43158 


71358 


71358 


Tj (sec.) 


8.30 


0.09 


42.23 


0.57 


T 2 (sec.) 


3.80 


3.45 


6.25 


5.88 


T 2 (sec.) 


0.76 


0.75 


1.14 


1.15 


Ttotai (sec.) 


12.86 


4.29 


49.62 


7.60 



The results indicate that above-mentioned acceleration algorithm archived the great 
reduction of . In other words, other approaches that reduce T 2 or Tt, are needed to 
develop more efficient isosurfacing algorithms. 

In our observations, a polygon-vertices identification process occupies the largest 
part of the computational time in constructing the topology of polygons. Though it 
would be possible to implement the isosurfacing algorithm without the polygon- 
vertex identification process, the process is desirable, because it reduces the amount 
of polygon-vertex data calculation and the memory-space. In Table 1, the number of 
vertices would be 3 — about six times greater — without the identification 

process. In this case, the computational time Tt, would be greater than the polygon 
construction time T 2 , and the memory-space would be about three times greater. 
Moreover, the identification process is necessary if isosurfaces are used for 
applications that require the topology of polygons, such as parametric surface 
reconstruction, mesh compression, or mesh simplification. 

An example of implementation of the vertex identification process is described in 
Doi and Koide [13]. Their implementation uses a hash-table to search shared vertices, 
however, such traditional search algorithm occupies the large computation time in our 





Fast Isosurface Generation Using the Cell-Edge Centered Propagation Algorithm 549 



observation. Isosurfacing process would be accelerated if the vertex-identification 
process could be implemented without the costly search algorithm. 

In this paper we propose an isosurface propagation algorithm that efficiently 
identifies shared polygon-vertices. When our algorithm constructs a polygon of an 
isosurface, it visits all cells adjacent to the vertices of the polygon, and registers the 
vertices to polygons inside the visited adjacent cells. The method does not require a 
costly vertex identification process, since a vertex is registered in all polygons that 
share the vertex at the same time, and the vertex is not required after the moment. 



2 Related Work 

2.1 Polygon-Vertex Identification 

When polygons in an isosurface are generated by the conventional Marching Cubes 
method [14], all their vertices lie on cell-edges, and mostly shared by several 
polygons. If a volume data structure contains all cell-edges data, the shared vertices 
are immediately extracted. However, cell-edge data is not usually preserved in a 
volume data structure, owing to the limited memory-space. 




Polygon & vertex generation — ^ Vertex registration — ► Vertex extraction 
Fig. 1. Polygon-vertex identification using a hash-table 

An example of a vertex identification process is described in Doi and Koide [13]. 
Given an iso-value of an isosurface C , the implementation first determines the sign 
of S{x,y,z)-C at nodes of a cell, where S(x,y,z) denotes the scalar value at a node. 
If all the signs are equal, the cell is not an isosurface cell and the calculation of the 
action value is skipped. Otherwise, the process extracts isosurface cell-edges of a cell. 
Here, a cell-edge is represented as a pair of nodes. When a polygon-vertex is 
generated on a cell-edge, it is registered to the hash-table with the pair of nodes that 
denotes the cell-edge. When another isosurface cell that shares the same cell-edge is 
visited, the polygon-vertex is extracted from the hash-table, by inputting the pair of 
nodes. In the implementation, all isosurface cell-edges are registered to the hash-table 
with the polygon-vertices of an isosurface. 
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Fig. 1 shows an example of this process. When polygon Pj is first constructed, 
vertices V^, Vj,, , and are registered in a hash-table with pairs of nodes. For 
example, vertex Vj, is registered with a pair of nodes, and «2 > that denotes a cell- 
edge that Vj, lies on. When polygon P2 is then constructed, vertex K* is extracted 
from the hash-table, by inputting the pair of nodes, nj and «2 • 



2.2 Isosurface Propagation 

An isosurface is efficiently generated by recursively visiting adjacent isosurface cells. 
Such recursive polygonization algorithms were originally proposed for efficient 
polygonization of implicit functions [15,16], and have been then applied to a volume 
datasets [12]. 

In a typical isosurface propagation algorithm, isosurface cells are extracted by a 
breadth-first traverse. In the algorithm, several isosurface cells are first inserted into a 
FIFO queue. They are then extracted from the FIFO, and polygons are generated 
inside them. Isosurface cells adjacent to the extracted cells are then also inserted into 
the FIFO. This process is repeated until the FIFO queue becomes empty, and finally 
the isosurface is constructed. 




Fig. 2 shows an example of a typical isosurface propagation algorithm. When 
polygon P[ is first constructed, four adjacent isosurface cells, C2 , C3 , C4 , and C5 
are inserted into the FIFO. When these cells are extracted from the FIFO, four 
polygons, P2, P}, P4, and P5 , are constructed. When P2 is constructed, adjacent 
isosurface cells, Cg , C7 , and Cg are similarly inserted into the FIFO. Polygons Pg , 
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P-j , and Fg are then similarly constructed when Cg , C7 , and Cg are extracted from 
the FIFO. 

The propagation algorithm has the great advantage of reducing the number of 
visiting non-intersecting cells. However, it also has a problem that the starting 
isosurface cells must first be specified. Efficient automatic extraction of the starting 
cells was previously difficult, especially when the isosurface was separated into many 
disconnected parts. 

The authors have proposed a method for automatically extracting isosurface cells in 
all disconnected parts of an isosurface [10,11]. The method first extracts extremum 
points of a volume, and then generates a skeleton connecting all extremum points. 
The skeleton consists of cells, and every isosurface intersects at least one cell in the 
skeleton. The method efficiently generates isosurfaces by searching for isosurface 
cells in the skeleton and then applying the isosurface propagation algorithm [12]. 

Our method [10,11] requires less than 0{n) computational time for isosurfacing 

process, since the cost of searching for isosurface cells is regarded as on 

average, unless the number of extremum points is enormous. The computational time 
of pre-processing in the volume thinning method [1 1] is always regarded as 0{n) . 

Remark that the vertex-identification process in the isosurface propagation 
algorithm still needs a vertex search algorithm. For example, polygon-vertices of P \ , 
Ffl , 1* , Fc , and , are registered into a hash-table when Pj is generated. The 
polygon-vertex is then extracted from the hash-table when P2 , P3 , and Pg are 
generated. 



3 Cell-Edge Centered Isosurface Propagation 

3.1 Algorithm Overview 

In this paper we propose an isosurface generation algorithm that does not need a 
search algorithm in its vertex identification process. Fig. 3 shows the overview of the 
new method. 

The method assumes that at least one isosurface cell is given. It first generates a 
polygon P[ inside the given cell, and allocates its polygon-vertices, V^, V/,, , and 

Vj . It then visits all cells that are adjacent to polygon-vertices of Pj . In Fig. 3, cells 
that are adjacent to Fj, are visited, and polygons P2 , P3 , and P4 are generated. 
Remark that polygon-vertices of the new three polygons are not allocated at that time. 
It then assigns Fj, to the three polygons. Fj, is no more required in this algorithm, 
because all polygons that share Fj, have been generated at that time. It means that the 
search algorithm is not necessary for the vertex-identification in the method. 
Similarly, in Fig. 3, cells that are adjacent to are then visited. Polygons P5 and Pg 
are generated at that time, and is assigned to them. 
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3.2 Combination with the Volume Thinning Method 

The new method assumes that at lease one isosurface cell is given, so the method 
should be combined with an isosurface cell extraction method. We applied the volume 
thinning method [11] in order to extract isosurface cells in all disconnected parts of an 
isosurface. 

The volume thinning method first generates an extrema skeleton, consists of cells, 
in a pre-processing. The extrema skeleton has a feature that every isosurface intersects 
it, so isosurface cells can be always extracted by traversing the extrema skeleton. 

Fig. 4 shows the pseudo-code of our implementation. The implementation first 
extracts isosurface cells from the extrema skeleton, and inserts them into a FIFO 
queue. It then extracts an isosurface cell C, from the FIFO, and construcis the 
polygon Pi inside Q . At the moment, though the number of polygon-vertices of is 
specified, each polygon-vertex is not allocated. The implementation then extracts the 
isosurface cell-edges of C, . If a polygon-vertex is not allocated on an isosurface 
cell-edge £„ at that time, the implementation allocates and registers to the polygon 
Pi , and visits all cells that share the cell-edge by using the connectivity of cells. If 
a polygon is not constructed in the visited cell Cj , the implementation constructs a 
polygon Pj in Cy . The polygon-vertex on E„ is registered into the polygon Pj . If 
the visited cell Cj has not been inserted into the FIFO, the implementation also 
inserts Cj into the FIFO at that time. The above process repeats until the FIFO 
becomes empty. 
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In this algorithm, most of isosurface cells are several times visited by cell-edge- 
centered process (the for-loop (3) in Fig. 4), and a polygon is constructed at the first 
visit. The cells are also visited when they are extracted from the FIFO (the for-loop 
(1) in Fig. 4), and all polygon-vertices of the polygons inside the extracted cells are 
set at the moment. The method processes a cell several times; however, our 
experimental tests show that its computational time is less than the conventional 
methods. 



void IsosurfacingO { 

for(each cell C, in an extrema skeleton) { 

if(C, is an isosurface cell) { insert C, into FIFO; } 

} 

/* for-loop (1) */ 

for (each cell C, extracted from FIFO ) { 

if(polygon F) in C, is not constructed) { Construct F) in C, ; } 

/* for-loop (2) */ 

for(each intersected edge E„) { 

if(a polygon-vertex on E„ is not added into F) ) { 

Allocate on E „ ; 

Register into F) in C, ; 

/* for-loop (3) */ 

for(each cell Cj which share FJ„ ) { 

if( Pj in Cj is not constructed) { Construct Pj in Cy ; } 

Register into P, in Cy; 

if ( Cj has never been inserted into FIFO) { insert Cy into FIFO; } 

} /* end for-loop(3) */ 

} /* end if(there is not Vn)*l 
} /* end for-loop(2) */ 

} /* end for-loop(l) */ 

for(each polygon-vertex V„) { Calculate position and normal vector; } 



Fig. 4. Algorithm of the cell-edge-centered isosurface propagation method 
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4 Experimental Results 

This section compares the experimental results given by the cell-edge centered 
propagation method with those given by the conventional propagation method. The 
experiments were carried out on an IBM PowerStation RS/6000 (Model 560). Four 
datasets for unstructured volumes consisting of tetrahedral cells, which contain the 
results of numerical simulations, were used for the experiments. 

Table 2 shows the results of experiments in which a series of 20 isosurfaces were 
generated for each volume, with various scalar values. Here, 

— and denote the numbers of cells and nodes in a volume. 

— N, and denote the total numbers of triangular polygons and vertices in the 20 

isosurfaces. 

— T] denotes the computational time of generating 20 isosurfaces by the 
conventional propagation method. 

— T 2 denotes the computational time of generating 20 isosurfaces by the cell-edge 
centered propagation method. 

— and ^P2 denote the computational times of the polygon construction 
processes of the two propagation methods. 

In these experiments, the volume thinning method [11] extracts the starting cells of 
the propagation. 



Table 2. Computational times of processes in generating isosurfaces 



Dataset 


1 


2 


3 


4 


Nc 


61680 


346644 


458664 


557868 


N„ 


11624 


62107 


80468 


97943 


N, 


80995 


135398 


494480 


1164616 


N, 


43158 


71358 


251506 


588796 


Tpx (sec.) 


3.45 


5.88 


21.35 


49.80 


Tp 2 (sec.) 


2.53 


4.01 


15.72 


36.20 


7] (sec.) 


4.29 


7.60 


26.65 


60.81 


Ti (sec.) 


3.35 


5.52 


20.61 


46.80 



The results show that the polygon construction process in the cell-edge centered 
propagation method is about 25 percent faster than the conventional propagation 
method, and the total isosurfacing process is about 20 percent faster. 



5 Conclusion 



In this paper we proposed an isosurface generation algorithm that does not use a 
vertex search algorithm for the vertex-identification process. The algorithm visits all 
cells sharing an isosurface cell-edge at the same time, and the vertex that lies on the 
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cell-edge is registered to all the polygons inside the visited cells. The vertex is no 
more required in the process, and the vertex search algorithm is not therefore 
necessary in our method. Our experimental tests showed that the method is about 20 
percent faster than the conventional implementation. 

In future, we would like to implement this method for hexahedral cells, and to 
measure the computational time of isosurfacing processes. 
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Abstract. This paper proposes a fast cell traverse method for volume 
rendering of irregular volume datasets. All cells of an irregular volume are 
subdivided into a set of tetrahedral cells for our algorithm. The number 
of calculations required to find the intersections of a ray and irregular 
volume is reduced by using the exterior faces of cells rather than the 
ray as a basis for processing. An efficient new method of computing the 
integration of the brightness equation along a ray takes advantage of 
the linear distribution of data within a tetrahedral cell. Benchmark tests 
proved that the proposed method significantly improves the performance 
of volume rendering. 



1 Introduction 

The volume rendering describes a given volume dataset as semi-transparent den- 
sity clouds, whose appearance can be easily modified by specifying a transfer 
function for mapping scalar data to color (brightness) and opacity (light attenu- 
ation) . The specification is performed by the user so that data values are related 
to meaningful colors, the part of the volume data most interesting to him/her 
is exposed, and the part that is not interesting to him/her is transparent. Im- 
ages are formed from the resulting colored semi-transparent volume by blending 
together volume cells projected onto the same pixel on the picture plane. This 
projection can be performed in either image order or object order. 

The object order approach has been in many ways preferable to the im- 
age order approach. It can take advantage of coherence when a voxel projects 
into many neighboring pixels. Methods have been developed for rendering 
the projection of volume cells by using Gouraud-shaded and partially trans- 
parent polygons [Wilh91,Shir90,Will92a,Laur91,Hanr90,Schr91,Koya92b]. But, 
when we deal with a very large volume dataset, the projected volume cell may 
be smaller than the pixel size. In that case, since the avarage number of viewing 
rays that intersect a single cell is small, we can expect that the image order 
approach gives a better performance than the object order approach. 

The image order approach is generally called “ray-casting,” because it scans 
the display screen and, by casting a viewing ray, determines for each pixel what 
volume cells affect it. The opacities and shaded colors encountered along the ray 
are summed to find the opacity and color of the pixel. Whereas in ray-tracing, 
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viewing rays bounce off when they hit reflective objects, in ray-casting, they 
continue in a straight line until the opacity encountered by the ray sums to 
unity or the ray reaches the exterior of the volume data. No shadows or reflec- 
tions are generated. Descriptions of ray-casting approaches appear in various 
studies [Lcvo88,Upso88,Sabe88,Dreb88] . 

Initially, attention in volume rendering was focused on medical imaging. 
This promoted the development of techniques for regular volume datasets. Cur- 
rently, we have a PC-based volume rendering hardware which can render a 128^ 
voxel dataset at a rate of thirty frames per second [Pfls99]. Recently, efforts 
have been made to support the rendition of data stored in curvilinear volume 
datasets [Wilh90a,Hong99] and in irregular volume datasets [Koya90,Garr90]. 
These techniques can be used to convert a non-regular volume dataset into a 
regular volume dataset. However, a ray-casting approach generally takes a lot 
of computational time. Its most computationally intensive portions are testing 
for intersections of ray and a non-regular volume, and integration of brightness 
along rays. 

In this paper, we propose a new cell traverse method to eliminate these 
two bottlenecks in computation. It reduces the number of intersection tests by 
using image coherence, and interpolates data efficiently along a ray by taking 
advantage of the characteristics of tetrahedral cells. 

2 Overview of the Image Order Approach 

To render a given volume dataset, it is necessary to calculate Eq. 1 for each 
viewing ray that passes through an eye position and a pixel position on an 
image screen. 



Here, B represents a total intensity along the ray, Ci means a luminosity in the 
i-th, and denotes an opacity in the i-th subdomain along the ray. 

In the image order approach, for each viewing ray, a scalar value Si and 
a scalar gradient are computed at the center of basically equal-sized vol- 
ume sample segments along the ray in front-to-back order, by interpolating from 
the scalar values and scalar gradients at the node points of the tetrahedral cell 
that includes the sampling location. The pseudo-color and opacity ai at the 
sampling location are determined by reference to pre-deflned color and opacity 
lookup tables from the sampled scalar value Si. A shaded color Ci is then com- 
puted by shading the pseudo-color, using some shading model such as Phong’s 
model [Phori75], where the surface normal is the normalized scalar gradient at 
the location. An implementation example is as follows: 

— Cast a ray, which is equivalent to a pixel position, into a volume dataset. 

— Interpolate a scalar value and a scalar gradient at a sampling point along 




( 1 ) 



the ray. 
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— Calculate the luminosity and opacity by using transfer functions and some 
shading model. 

— Composite the color values by using the opacity for the final pixel values. 
The corresponding algorithm is as follows: 

For a pixel position (p) 



For a sampling point (m) 

1. ^ interpolate-in-cell (^s'^odel ^ ^node 2 ^ ^nodeS ^ gUode4^ 

2. Vs^ ^ interpolate-in-cell (ygnodel ,^gnode2^^gnode3^^gnode4^ 

3. ^ opacity(s^) 

4. <— luminosity (color (s^),Vs^) 



— is the brightness resulting from integrating Eq. 1 from the nth sampling 
point to the mth sampling point (n > m): 



The final brightness is = B^. 

— and denote a scalar value, a scalar gradient, an opacity, 
and a luminosity at the mth sampling point along the pth viewing ray (pixel 
position). 

— interpolate-in-cell(,,,) is an operator that interpolates a value from the values 
at four node points (nodel, node2, nodeS, and node4) that define a cell (in 
this case, a tetrahedral cell). 

— opacityO is a function for transferring a scalar value to an opacity value. 

— color() is a function for transferring a scalar value to its pseudo-color, which 
generally has three components of red, green, and blue. 

— luminosity (,) means a shading model. For simplicity, we assume that the 
model is a function of pseudo-colors and a scalar gradient. 

One major advantage of the image order approach is that geometric data, such 
as polygonally defined objects, can be easily integrated into the above procedure. 
An integrated rendering algorithm will be given in the next section. On the other 
hand, a well-known disadvantage of this approach is that the computation cost 
is huge because it does not take advantage of coherence within the volume data 
sets. In the following sections, we will describe two techniques for 

— reducing the cost of ray-face intersection tests and 

— reducing the cost of interpolation of the sampled values 




where 




( 2 ) 
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3 Reducing the Cost of Ray-Face Intersection Tests 

3.1 Related Work 

In order to calculate the brightness of Eq. 1 for a given viewing ray, an exterior 
face that intersects the viewing ray is first searched for. Then, the cells along 
the ray are traversed until it exits from another exterior face. An additional 
calculation to check for reentry of the ray into the volume is required in handling 
a nonconvex mesh. 

The cost of ray-face intersection tests strongly affects the total computa- 
tional expense of volume rendering. Rubin and Whitted reported that most 
of the total computation time of a ray-tracing program is spent on intersec- 
tion tests [RubiSO]. Various algorithms for reducing the number of intersection 
tests have been developed. Some of them use simple bounding volumes: Ru- 
bin and Whitted replaced exhaustive searching for intersections by checking 
with simple bounding volumes [RnbiSO], while Weghorst et al. examined how 
bounding volumes can be selected in such a way as to reduce the computational 
cost of the intersection tests [Wegh84]. Another type of improvement employs 
techniques for subdividing three-dimensional space. Glassner investigated the ef- 
fects of partitioning the space with an octree data structure [Glas84]. Fujimoto 
et al. compared octrees with a rectangular linear grid. A shortcoming of this 
type of technique is that some rays must be cast that do not pass through any 
face [Fuji85]. 

The above algorithms take no account of the knowledge that neighboring 
rays are very likely to intersect the same exterior face. Obviously, a new tech- 
nique that exploits image coherence must be found in order to decrease the 
computational expense. Weghorst et al. created an item buffer to improve the 
first ray-intersection test [Wegh84]. This concept can be applied to the test for 
the intersection of a ray and the nearest (or furthest) exterior face of a tetra- 
hedral model. Moreover, it can be extended to an efficient ray-face intersection 
algorithm, if we assume that refraction, reflection, and shadow rays are not con- 
sidered in the ray tracing of a tetrahedral model. 

Scan- Conversion of Exterior Faces One promising idea for such an intersec- 
tion algorithm is to cast viewing rays into a volume dataset from each exterior 
face, which is processed either from back to front (BTF) or from front to back 
(FTB). In BTF, only back-facing exterior faces are processed. In FTB, only 
front-facing exterior faces are processed. Note that we process about half of the 
exterior faces in either case. Before casting rays, we need to calculate the priority 
of the back- or front-facing exterior faces by depth sorting. The intersection of 
a viewing ray and an exterior face can be easily determined by scan-converting 
it on the screen. If the visualized volume is convex, the priority need not be 
calculated, because there is no overlapping of back- or front-facing exterior faces 
when they are scan-converted on the screen. In this case, exterior faces can be 
scan-converted at random. The intersections can be incrementally calculated by 
using the digital differential analyzer (DDA) approach, because they exist within 
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the face projection on the screen. If a vertex that is shared by triangles on the 
screen coincides with a pixel position, duplicate viewing rays may be cast. This 
leads to a visual artifact in the generated image. The coordinates of the vertex 
can be perturbed infinitesimally to eliminate such degenerate cases. 

We start the cell traverse from a position that is incrementally decided. Ref- 
erence to the cell adjacency component of the tetrahedral model enables us to 
search for cells that intersect the viewing ray successively until the ray reaches an 
exterior face. When it exits, the calculated brightness is stored at the correspond- 
ing pixel position of the frame buffer. If no other exterior face is scan-converted 
to this pixel, the value becomes a final one; otherwise, the result is used as an 
initial value for calculating the brightness. The whole volume, either convex or 
nonconvex, is finally traversed when all the exterior faces have been processed 
in order of priority, as shown in Figure 1. 




Fig. 1. Scan-conversion of exterior faces 
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3.2 Reducing the Cost of Interpolation of the Sampled Values 

Interpolation in a Tetrahedral Cell In order to calculate the luminosity 
and opacity at a sampling point rapidly, it is very important that data, such 
as scalar data and scalar gradients, should be efficiently interpolated along a 
viewing ray. In general, a scalar value Sx at a point X in a tetrahedral cell 
(ABCD) is interpolated as 

Sx = Na X Sa + Nb X Sb + Nc x Sc + No x Sd (3) 



where 

— Na,Nb,Nc, and No are the interpolation functions of the tetrahedral cell 
at node points A, B, C, and D, respectively, and the equation Na + Nb + 
Nc + Njj = 1 holds. 

— Sa, Sb, Sc, and Sd are the scalar values at A, B, C, and D, respectively. 

Obviously, it takes a lot of CPU time to interpolate scalar values at multiple 
sampling points in the cell by simply repeating this method. Our concern is with 
the interpolation at sampling points along a viewing ray. In this connection, we 
should remember that the data distribution in a tetrahedral cell is linear in any 
direction, as we have shown. Using this feature, we developed a new method 
of interpolation, which we call a linear sampling method. The procedure for 
interpolation has three steps. To illustrate this, we assume that a viewing ray 
enters a cell at point P in triangle (ABC). 

Linear Sampling Method First step: The first step is to determine the 
face through which the ray leaves a cell. In our approach, a ray is mapped to a 
point in a pixel plane, since we consider only a viewing ray, namely a ray from an 
eye position. Therefore, we simplify the problem to one of determining whether 
a point that represents the ray is included in the triangle on the screen. The 
triangle and the ray are expressed in a normalized projection coordinate system 
(NPC). We use an example to check whether the ray leaves the cell at the point 
Q in triangle CD A, as shown in Figure 2. 

Note that the check is performed in two-dimensional space. If the vectors 
AC and AD are parallel, the point is not included in it. The ray does not leave 
the cell from CDA, and another triangle of the cell is checked. If the vectors are 
not parallel, the vector AQ can be expressed as a linear combination of them: 

AQ = SQ X AC + tQ X AD, (4) 

where sq and tg are weighting values. The matrix system for (sg, tg) is 

XC - XaXd - XA 

yc-VA yo- yA 

where {xc,yc)XxD,yD),{xA,yA), and (xg,yg) are the coordinates of points A, 
B, C, and P in NPC, respectively. The system is solvable, because the matrix is 
regular. If sg and tg satisfy the following conditions: 



SQ \ _ 



xg - XA 

yq - yA 



( 5 ) 
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Fig. 2. First step of linear sampling method 



- SQ >0 

- tQ>0 

~ SQ + tq < 1, 

the point Q is inside the triangle CDA and checking need not be done at further 
candidate faces. Otherwise, the point is outside the triangle and another triangle 
is checked. Our approach checks two triangles on average. 

In contrast, the conventional approach first calculates the distances along 
the ray at which it intersects the planes of three candidate exit triangles and 
then selects as the exit face the triangle that has the minimum distance value. 
The triangle and the ray are expressed in a view reference coordinate system 
( VRC) . This approach has the merit of being able to handle various other rays 
in addition to viewing rays. However, it always needs to check three triangles 
in three-dimensional space, which is computationally inefficient in comparison 
with our approach. 

Second step: The second step is to interpolate data (Sq) at the intersection 
(point Q) of a cell and a viewing ray on the face determined in the first step, as 
shown in Figure 3. By replacing X with Q in Eq. 3, the data {Sq) are interpolated 
as 

Sq = s X Sc + t X Sd + - s - t) X Sa, (6) 

where s = Nq, t = Njy, and Nb = 0. For this step, we propose two interpolation 
techniques that differ according to the way in which the weighting values (s,t) 
are calculated: an accurate technique and an approximate technique. 

In the accurate technique, the weighting values can be obtained by solving 
matrix systems similar to Eq. 5, not in an NPC but in a VRC. All coordinates 
should be inversely transformed from an NPC into a VRC. Before the transfor- 
mation, the z-coordinate of the point Q is calculated by using Eq. 6. Without 
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losing generality, we can assume that the triangle CDA is not parallel to the 
z-axis, because a regular triangle cannot be parallel to all three orthogonal axes 
X, y, and z. This assumption means that the parallel projection of the triangle 
CDA onto the xy-plane is not a straight line. As the point Q is in the triangle 
CDA, 

= SQB X AC + tQE X AD, (7) 

where sqe and Iqe are weighting values. Note that the values sqe and Iqe are 
generally different from sq and tq of Eq. 5, respectively. Taking account of the 
x-components and y-components, the matrix system for {sQE,tqE) is 

XCE — XAE XdE — Xae 
_ yCE — VAE UDE — VAE 

where {xcE,ycE),ixDE,yDE),ixAE,yAE), and (xQE,yQE) are the coordinates 
of points C, D, A, and Q in the VRC, respectively. For efficient calculation, the 
coordinates of the nodal data component in the tetrahedral model should be 
stored in both the NPC and the VRC. If the triangle is parallel to the z-axis, the 
y-components and z-components, or the z-components and x-components, could 
be considered instead, according to whether the triangle is parallel to the x-axis 
or y-axis, respectively. The calculated weighting values {sQE,tqE) are used for 
(s,t) in Eq. 6. 

In the approximate technique, the weighting values (s,t) are approximated 
by using the values of {sQ,tq) calculated in the first step. No calculation for the 
inverse transformation is required in this technique. The technique ensures Cq 
continuity across edges shared with neighboring triangles. However, it is not 
accurate, because it depends on the viewing parameters, such as the location 
of the viewing point. For example, when a relatively large triangle is rendered 



( = ( ^QE - xae\ 

V tQE ) V yQE - yAE j 
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by this technique, a noticeable artifact appears in the generated image. Since 
we can assume that the tetrahedral cells handled are relatively small, we do not 
expect such an artifact to cause any problems. This expectation is confirmed 
by two images generated by using the two interpolation techniques given in the 
section on the performance evaluation. 

Third step: The third step is to interpolate data at a sampling point, 

labeled X, along the ray by using the data {Sp, Sq), as shown in Figure 4. We 
may assume that Sp has been previously interpolated. Using an interior division 
ratio r, defined as 



r = 



PX 

PQ' 



(9) 



we can calculate the data at a sampling point labeled X (Sx) as 



Sx = r X Sq + {1 - r) X Sp, (10) 

because they are linearly distributed along the segment PQ. When interpolating 
the data at another sampling point along the ray in the same cell, we have only 
to repeat the third step. 




4 Performance Evaluation 

4.1 Cost Functions 

The algorithm for calculating the final image is given in Figure 5. 

We divide the total cost, which excludes the cost of depth sorting of exterior 
faces, into three parts. The cost of depth sorting will be discussed in the next 
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Calculate the priority of exterior-faces 
For an exterior face 

1. Scan-convert the face 

2. Calculate a pixel position by DDA 

For a viewing ray that is cast from the position 

a. Calculate the color and opacity 

b. Composite them into a final brightness 

c. Find the first cell 

For a cell 

1) Search for the exit triangle 

2) Calculate the weighting values (s, t) 

3) Interpolate the scalar and gradient 

For a sampling point 

a) Calculate the ratio r 

b) Interpolate the scalar and gradient 

c) Calculate the color and opacity 

d) Composite them into a final brightness 

4) Search for the next cell 

d. Search for the exterior face 

e. Calculate the color and opacity 

f. Composite them into a final brightness 



Fig. 5. Fast algorithm for the image order approach 

section. The first part, tr-e, is the cost of calculating the intersections of viewing 
rays and front- or back-facing exterior faces, and the colors and opacities at 
the intersections. The second part is the cost of calculating the intersections 
of viewing rays and interior faces, and the scalar values and scalar gradients 
at the intersections (the first and second steps of the linear sampling method). 
An interior face is one that is shared by another cell in a volume. This cost 
is denoted as when the accurate technique is employed as the second step, 
and when the approximate technique is employed. The third part, tg, is the 
cost of calculating the colors and opacities at sampling points (the third step of 
the linear sampling method). Consequently, the total cost functions (Tac,Tap) 
of this rendering algorithm are estimated as 



— Nre is the number of times that viewing rays intersect front- or back-facing 
exterior faces. 

— Nc is the number of times that viewing rays visit tetrahedral cells. 

— Ns is the number of sampling points along viewing rays. 



Tag = Nre X tre + Nc X tc^ + Ns X ts, 
Tap = Nre x tre + Nq X + Ns X ts, 



( 11 ) 

( 12 ) 



where 
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To demonstrate the efficiency of the algorithm, let us compare it with another 
algorithm, which has the following features: 

— The linear sampling part is the same. (The accurate technique is employed 
as the second step.) 

— The calculation of ray-face intersection uses bounding volumes whose shape 
is a frustum of a quadrangular pyramid, so that the intersection tests can 
be reduced to simple comparisons against the limits of the volumes on a 
pixel plane. (Exterior faces are projected onto the plane, as in the proposed 
algorithm.) 

In this algorithm, given in Figure 6, a fourth cost, U, is added. This is for 
testing the bounding volume for intersections. The first cost, tre, in this algorithm 
includes the cost of searching among all the candidate faces whose bounding 
volumes intersect a viewing ray for the face that the ray intersects. If the cost is 
denoted as tre^, the cost function {Tf,y) is 

Tbv — ^re X N(j X Ng X tg -\- Ny X Ng X ty, (13) 

where 

— Ng is the number of front- or back-facing exterior faces. 

— Nr is the number of viewing rays that hit the volume. Nr < iVre, because a 
viewing ray may intersect more than one front- or back- facing exterior face. 

The main difference between the two algorithms is the object that controls 
the outermost loop: an exterior face for the former and a viewing ray for the 
latter. A ray-by-ray approach for intersection tests such as that in the latter 
algorithm requires a kind of “exhaustive search” for intersections. Indeed, the 
cost tr of one test is trivial, because the test is a simple comparison, but the 
number of tests. Nr x Ng, is very large. Therefore, the product Nr x Ng x tr 
is not negligible. A face- by- face approach such as the former algorithm is more 
efficient, because it makes use of coherence to avoid unnecessary intersection 
tests. 

5 Result and Discussion 

To demonstrate the efficiency of our algorithm, four irregular data sets of CFD 
results were rendered. These data sets, which involve cavities or holes, are cat- 
egorized as nonconvex meshes. All the images were computed on an IBM 3090 
at a resolution of 512 x 512 pixels. 

Table 1 contains statistics on the test images. 

From the values of Tag in four datasets, the cost components of the total cost 
function in Eq. 11 are calculated as 



Ug = 3.98 X lO"'^ 


(14) 


= 0 77 X 10-5 


(15) 


tg = 1.29 X 10-5. 


(16) 
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Calculate the priority of exterior-faces 
Calculate the bounding volumes of the faces 

For a viewing ray 

For an exterior face 

1. Find the bounding volumes that the ray hits 

For an exterior face surrounded by the volume 
1. Find the exterior faces that the ray hits 

For an exterior face that the ray hits 

a. Calculate the color and opacity 

b. Composite them into a final brightness 

c. Find the first cell 

For a cell 

1) Search for the exit triangle 

2) Calculate the weighting values (s, t) 

3) Interpolate the scalar and gradient 

For a sampling point 

a) Calculate the ratio r 

b) Interpolate the scalar and gradient 

c) Calculate the color and opacity 

d) Composite them into a final brightness 

4) Search for the next cell 

d. Search for the exterior face 

e. Calculate the color and opacity 

f. Composite them into a final brightness 



Fig. 6. Conventional algorithm for the image order approach 



Table 1. Statistics on benchmark tests for the image order approach 



Dataset name 


no. 1 


no. 2 


no. 3 


no. 4 


N 


10167 


61680 


372500 


557868 


Nr 


89694 


165677 


101107 


127616 


Nre 


91740 


194408 


105018 


132190 


N, 


888 


2716 


6023 


8839 


iVc 


2001800 7364650 6842093 11782659 


N, 


2794063 7475567 2828874 


4012530 


Total time 










Thv 


520.1 


2125.4 


2374.5 


4188.1 


Tac 


211.5 


661.3 


539.0 


906.7 


Tap 

(seconds) 


151.0 


483.6 


359.7 


606.9 


Tsort 

(seconds) 


0.3 


2.6 


14.2 


29.2 
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The validity of these costs can be confirmed by using other datasets. By 
using the values of in four datasets and the calculated costs ts), 

the costs, tr and tre', of Eq. 13 are calculated as 

U = 2.80 X 10"® (17) 

= 1.49 X 10"^ (18) 

The value of tre^ is more than that of This is attributed to the cost 
of checking the exterior faces that are not intersected by a ray but that are 
surrounded by the bounding volumes hit by the ray. The value of ts is less than 
that of tc- This means that the cost at sampling points has less effect on the 
total cost than that at cell intersections. In other words, the total cost is not 
doubled even if twice as many sampling points are placed in such a way as to 
increase the image quality. By using the values of Tap in four datasets, and the 
calculated costs, the cost, , in Eq. 12 is calculated as 

= 4.12 X 10"®. (19) 

This cost is about forty percent lower than that of the accurate technique. More- 
over, we cannot observe any noticeable difference between the two images, which 
were generated by using the two interpolation techniques, because the size of 
each tetrahedral cell is very small in relation to the extent of the image plane. 
Therefore, an approximate technique is preferable as the second step of the linear 
sampling method. 

In the first step of the linear sampling method, we tested the ray individually 
against each candidate triangle to see if the ray intersected with the triangle. On 
the other hand, Hong treats the candidate triangles together as a group and uses 
a ray-crossing technique to find which triangles intersect with the ray [Hong99]. 
To compare our technique with Hong’s technique, we measure a cost to find an 
intersected triangle and calculate the weighting values, sq and tg, using a single 
tetrahedral cell mapped onto an image plane. To consider finite precision of the 
measurement function, we repeat the first step thirty thousand times. Given an 
entry-point to the tetrahedral cell, Hong’s technique calculates the ray-crossing 
number for each of the six triangle edges and the weighting values for a triangle 
the ray-crossing number of which is exactly one. 



Table 2. Time cost for the first step (seconds) 



No. of checked triangles 


1 2 3 


Ours 


0.46 0.86 1.27 


Hong’s 


1.67 1.68 1.70 



The cost of our technique is dependent on the location of the entry point in 
the tetrahedral cell. On the other hand, that of Hong’s technique is independ 
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on the location. In the worst case, our technique calculates the weighting values 
three times. Therefore, we consider three cases in which our technique checks 
one, two, and three triangles. From Table 2, we confirm the effectiveness of our 
technique over Hong’s technique even in the worst case. 

The total time costs of dataset no. 4 were calculated by substituting these 
calculated costs and the statistics in Table 1 into Eqs. 11-13, as shown in Figure 7. 
The results of the calculation show the following: 




Fig. 7. Improvement in the performance when dataset no. 4 is used 



1. If we adopt the intersection test based on bounding volumes, most of the 
total time cost is for the calculation of ray-face intersection tests. 

2. If we adopt the intersection test based on scan conversion, the time required 
for the tests is very greatly reduced. 

3. Interpolation at intersections of rays and cells still occupies a large part of 
the total computation time, although employment of an approximate inter- 
polation technique reduces it significantly. 

The total time cost is 0{Nc) in a rough estimate from Eqs. 11 and 12. The 
number Nq can be described as 

Nc = N^c'" X (20) 

where is the average number of volume cells that a viewing ray visits 

from the entry point until the exit point. This number is estimated as 0(iV^/^), 
where N is the number of volume cells. Therefore, the performance of our method 
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except for depth sorting of front- or back-facing exterior faces is approximately 
0(iVi/3). 

Our algorithm requires depth sorting of front- or back-facing exterior faces 
as a pre-process. In the above test, an approximate priority was calculated by 
simply sorting the z-coordinates of the face-centroids. The priority resulting 
from the approximation may be incorrect and may lead to noticeable artifacts 
in the image. However, even this approximate sorting can generate a correct 
priority in many cases, because, unlike interior faces, the front- or back-facing 
exterior faces that overlap on a screen are likely to be sufficiently distant from 
each other for the sorting. The time cost, Tgort, of the sorting is small for the 
test datasets, as shown in Table 1. For larger datasets, this cost might not be 
negligible, because it is 0{N^). In order to improve the performance, we can 
use a more efficient sorting algorithm such as radix sorting, whose performance 
is estimated as 0{Nf.) [Sutli74]. For correct sorting, we could employ the list- 
priority algorithm proposed by Newell [Newe72] . Although an exterior face may 
be split in this case, our algorithm can be made to work simply by replacing 
a component of an original exterior face with components of the split exterior 
faces in the tetrahedral model. 

6 Summary 

Although they can be expensive, volume ray tracing techniques are very useful for 
rendering irregular volumes with cavities or voids. The bottleneck of the overall 
process, which was confirmed by using some irregular data sets, is the ray-face 
intersection testing. To solve the problem, we incorporate a projection approach 
into the ray-casting process. Moreover, our technique can be used to speed up a 
conversion process to reconstruct a voxel dataset from a given irregular volume 
dataset. 
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Abstract. We propose a new design concept for controlling the 
deflection of a micro-membrane with the aid of its thickness 
distribution for realizing a prescribed design in the MEMS. As an 
example, we treat a micro air pump that comprises a micro-membrane. 
The membrane is actuated by an electrostatic force. The membrane 
deflects and thus the deflection is influenced by the air pressure and the 
electrostatic field. This is a highly complicated system. To find out a 
proper thickness distribution, we use the genetic algorithm that is 
appropriate to reduce the searching space of solution. 



1 Introduction 

Recently, a few types of micro-electro-mechanical system (MEMS), e.g., micro 
optical mirrors, micro valves, micro pump and micro actuators, have been developed 
and used in practice (1-3). These devices are actuated by electrostatic force and/or 
fluid pressure. In developing the MEMS, the shape and material for the parts should 
be designed properly so that their motion would provide a desired function. We must 
consider that the dynamic response of the micro-membrane actuator is strongly 
affected by its shape and its mechanical properties. When a membrane is subjected to 
air pressure, electrostatic force or other external force, it would deflect depending on 
its material properties and boundary conditions. The deflection of the membrane in 
turn influences the airflow or pressure and electrostatic force. If the membrane is of a 
constant and uniform thickness, a particular external force gives a particular 
deflection to the membrane. If, however, the thickness is not uniform, the deflection 
may be varied. In other words, the deflection would be controlled by its thickness 
distribution. In fabricating a micro-membrane by physical vapor deposition or 
chemical vapor deposition, it is easy to distribute the thickness of the thin films. It 
would thus be possible to develop a new mechanical structure for the desired 
performance. 

However, this is a highly complicated system and we should develop a new 
scheme for obtaining an optimal shape or thickness distribution of the membrane. In 
our previous study, we proposed a method of analyzing the dynamic response of the 
micro-membrane actuated by an electric field (4). 
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The purpose of this study is to develop an optimization method for calculating a 
thickness distribution of the micro-membrane actuated by an electrostatic force. Since 
the genetic algorithm is appropriate to reduce the searching space of solution, it is 
employed to find out an optimum shape (5). As an example, the thickness distribution 
of the micro-membrane of a micro-air-pump actuated by an electrostatic force is 
optimized to realize the prescribed response. We will show the numerical method for 
optimization based on the genetic algorithm and we discuss the convergence of the 
solution and efficacy of the developed method. 



2 Dynamic Response of Membrane 

As an example of MEMS actuated by an electrostatic force, we treat a model of a 
micro-air -pump as shown in Fig. 1. This composes a micro-membrane, or an upper 
electrode, and a base, or a lower electrode. The size of the micro-pump is dOOpm 
long, 200|4m wide and 20|4m high. In this pump, the outlet is located not at the center, 
but 150pm apart from it. The air is pumped out according to the deflection of the 
micro-membrane actuated by an electrostatic force. 

Pulled by the electrostatic force, the micro membrane undergoes deformation in 
both in-plane and out-of-plane directions. Therefore, stress equilibrium equations of 
in-plane stress and bending are used. As the micro membrane deflects, its distance 
from the lower electrode changes. Thus, the change of actuation force due to the 
micro-membrane deflection should be considered. We used the following equations 
for simulation. 

1) Electric field 

The Laplace equation for the electrostatic field where a distance charge between 
the upper and lower electrode does not exist is written as 



= 0 



[ 1 ] 



where (f) is an electrostatic potential. The strength of electrostatic field E is 



given by 



Micro membrane electrode 




Center l^Opm 



Inlet 



Outlet 



Fig-1 Schematic view of micro pump 
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E=-W^ = -^ [2] 

on 

where n is the outward normal to the surface of the area concerned. The electrostatic 
stress is written as 

[31 

where e is dielectric constant of the fluid. 



2) Stress equilibrium 

The stress equilibrium equation for in-plane deformation of the membrane is 
written as 



dt 

X _| fy 

d y 
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d y d X 



+ f =0 

J y 



[4] 



where f„ and fy are body forces in x and y directions, respectively. 

The membrane is bent by the dynamic pressure of the fluid and electrostatic force. 
The stress equilibrium equation for bending is expressed by 









dx^ 






2 J ,,2 '^^yy 



d x^d y 



d"^w 

J7 



d 



-t^a„ 



d^M 

dy‘ 



■= P-Pa+fe+t„iP 



d^M 



[5] 



with 



F t 

D = . 

12(1 

F t ^ 

D^=-^n 



D = — ,D = 

“ 12(1 "" 

ExEy 

~ (\ + 2v^E^+(\ + 2v,)E^ 



[VyEx+VxE,)t^ 

24(1 -v,v A 



where w is membrane deflection; Dxx, Dxy, Dyy and D^s are flexural rigidity; t„ is the 
thickness of micro-membrane; p pressure of fluid; pa atmospheric pressure; p is 
density of micro-membrane; is electrostatic stress; Ex and Ey are Young’s moduli of 
micro-membrane in x and y direction respectively; Vx and Vy are Poisson’s ratio. 
Since, the thin film is deposited by physical and/or chemical vapor deposition, the 
mechanical properties would be anisotropic. However, we assumed the material to be 
isotropic in the present simulation. 

3) Fluid equation 

The fluid, that is, the air in the pump is squeezed out by the micro-membrane 
deflection. To simulate the fluid flow and the pressure distribution in the micro-pump 
accurately, it is necessary to consider the fluid flow, a pressure loss and blowout of 
the fluid at the outlet. To do this, the Navier-Stokes equation should be solved. 
However, considering that the aspect ratio of the micro-pump cavity is more than ten 
and the height is in the order of 20 |im, we may estimate the performance of the 
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micro-pump from the pressure distribution on the deflected micro-membrane and 
from the volume change of the pump cavity without taking account of these factors. 
Therefore, we use the modified Reynolds equation as the fluid equation. The outlet is 
a hypothetical one without a through hole. 

The compressible fluid-pressure is expressed by the modified Reynolds equation 
considering the slip on the material surface as 



d X 



dp 



d X ) dy 



dp 









dy 



dy 
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V ^ + V 
^ dx y 



d X 
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dy 



dp 
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dt 



[ 6 ] 



where h is the distance between the micro-membrane and the lower electrode, Xg is 
the molecular mean free path of the air, t refers to time, and Vy are velocity of the 
surface of the micro-membrane in x and y directions, respectively, and is viscosity 
of the fuild. h is given by h = h„+w , where ho is initial gap between the micro- 
membrane and the lower electrode. and Vy are calculated by the velocity of the 
micro-membrane deflection. 

Equations [4], [5] and [6] should be solved simultaneously to analyze the 
deflection of the micro-membrane (4). The distance between the micro-membrane and 
the lower electrode in the present case is large enough so that the effect of molecular 
mean free path of the air may be negligibly small. 

In the coupled analysis, first, the electric field equation is solved by the boundary 
element method. Second, derivatives with respect to time involved in equations [5] 
and [6] are calculated by the implicit method. Finally, the stress equilibrium equations 
[4] and [5] and the modified Reynolds equation [6] are solved until the deflection w 
becomes unchanged. This calculation was carried out iteratively. According to the 
deflection of the micro-membrane, the boundary elements of the electrostatic field are 
modified. 



3 Coding and De-coding of Membrane Thickness for 
Optimization 

We use the genetic algorithm to find out an optimum thickness distribution of the 
micro-membrane so that the pressure of the fluid at the vicinity of the outlet becomes 
a maximum. In the present method, the membrane consists of elements with two 
different thicknesses. 

In the genetic algorithm, the thickness of the micro-membrane were coded to 0 
or I . As an example, the coded genes are as follows. 



111000001100001111000001100001111000001100001111000001100001111000 
0011000011 1 10000011000011 1 1 
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Each component of the gene refers to the thickness of the membrane of the finite 
element; ‘1’ means ti and ‘0’ means t 2 , as shown in Fig.2. The thickness of the 
neighboring four meshes is of the same value. 

For optimization by the genetic algorithm, a fitness for each population should be 
calculated and verified for all population in each generation. The fitness function 
should be defined that is proper to express the characteristics of the phenomena for 
optimization. As a demonstration of the proposed method, we optimize the thickness 
distribution of the micro-membrane used for the micro -air-pump as shown in Fig. 1 . If 
the thickness of the micro-membrane actuated by the electrostatic force is uniform, 
the deflection of the micro-membrane is the largest at the center. The fluid pressure at 
the center also becomes a maximum, while the pressure at the outlet is low. 
Therefore, the performance of the micro-pump composed of the micro-membrane of a 
uniform thickness may be low. The maximum deflection point may better be changed 
according to the thickness distribution of the micro-membrane. Since the pressure 
distribution depends on the distance between the micro-membrane and the lower 
electrode, we utilized the distance at the outlet as a fitness function. Further, to 
increase the flow at the outlet, we considered the volume change in the pump cavity 
due to the micro -membrane deflection. We define the fitness function / as. 
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where (xii,yh) is the outlet position, (Xp,yp) a maximum defleetion point, (X(,,yc) the 
eenter of the membrane, Vdef the internal volume of the miero-pump cavity after the 
micro-membrane deflection, and Vi„it is the initial volume of the pump cavity. In this 
case the lowest possible value of “fitness” is the goal. 

We used the one point mutation for cross over and elite strategy. The cross over 
rate was 0.9 and the mutation rate was 0.005. We used the tournament method for 
selection with population of 20 up to the 40th generation (5). In each generation, the 
micro-membrane deflection was calculated by the coupled analysis for 20 
populations. The thickness distribution of each population for the following 
generation was calculated from the results of the previous one. 



4 Results and Discussion 

The material of the micro-membrane is silicon. The material constants for calculation 
are summarized in Table 1. In the calculation, the outlet is assumed to be located 
1 50jim apart from the center of the micro-pump or 50|im from the side -wall. As a 
first demonstration of optimization, 2|im (tj) and 3|im (t 2 ) thick elements were 
distributed to move the point at which the deflection gives a maximum at the outlet 
and to increase flow volume at the same time. Figure 3 shows the thickness 
distribution and deflection of the membrane of the initial population. Since the 
thickness is distributed randomly, the deflection of the membrane is a maximum at 
the center. Therefore, the pressure at the vicinity of the outlet is lower than at the 
center. The fitness value of each population decreases with increasing generation as 
shown in Fig.4. After the 15th generation, the best fitness for 20 populations 
decreases to 0.55. The average fitness also decreases to about 0.6. The best fitness 
does not change after 20th generation. However, the flow volume increases with 
increasing generation gradually. This means that the thickness distribution is 
optimized to move the point at which the deflection gives maximum before the 15th 
generation and to increase the flow volume after the 15th. 

To move the maximum deflection point further near to the hypothetical outlet, we 
distribute the elements of 2|im and 4jim thick. Fig. 6(a) shows the thickness 
distribution of the best fitness population in the 40th generation. The maximum 
deflection point, see Fig. 6(b), is shifted about 40)im from the previous result with 
2|im and 3|im thick. It is still apart about 20|im from the outlet, that is, 70pm apart 
from the side wall. 

In the second example, the maximum deflection point was moved toward the outlet 
comparing with that in the first; the pump out volume rate was decreased by about 
30%. This would be due to the fitness function that treats the deflection point and the 
volume as given by equation [7]. 
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(a) Thickness distribution and deflection of membrane 




(b) Pressure distribution 



Fig-3 Thickness and pressure distribution of initial 
generation 




Fig.4 Fitness and flow volume in each generation 

Figure 5(a) shows the thiekness distribution of the best fitness individual in the 
40th generation. As seen in Fig. 5(b), the maximum defleetion point of the miero- 
membrane is still 60|im apart from the outlet. As shown in Fig. 5(e), the point at 
whieh the pressure gives a maximum is shifted toward the outlet. Although, the 
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response of the membrane did not coincide with the prescribed design, the 
performance of the pump may be improved. The strain distribution in the micro- 
membrane is plotted in Fig. 5 (d). The area with 2\im thick obviously coincides with 
that where the strain is relatively large. In the genetic algorithm, the population that is 
easier to deform at the vicinity of the outlet is selected as a better one. Since the 
deflection of the micro-membrane is suppressed at the center of the pump, the 
membrane thickness becomes 3|im near the center. We could not reach the best 
thickness distribution that provided the desired deflection. This may be due to that the 
calculation was undertaken for the defined thickness variation, 2pm or 3pm and the 
geometrical configuration of the pump. 

It is obvious that the optimized results depend on the fitness function and the 
assumption in the simulation, such as, the ratio of thickness of the membrane, 
number of thickness variations, geometry of the pump etc. If we specify the geometry 
of the pump, the maximum pressure at the outlet and the volume flow rate, we could 
be able to succeed by obtaining a proper set of thickness distribution provided that an 
appropriate fitness function is defined. 

To evaluate the calculated results, we fabricated the micro-membrane actuator that 
the thickness distribution was optimized by the proposed method. Figure 7 shows the 
fabrication process. The thickness of the silicon wafer was reduced by the wet etching 
with the Si02 as a etching mask. The membrane is bonded on the glass basement on 



Table 1 Material of micro-membrane and geometrical 
properties for analysis 



Length (jam) 


400 


Width (pm) 


200 


height (pm) 


20 


Thisckness of the membrane (pm) 


2 or 3 


Mass density of the membrnae (kg/m^) 


2330 


Young's moduius (GPa) Ex , Ey 


150 


Poisson's ratio 


0.3 


Viscosity (pPa g) 


17.6 


Molecuiar mean free path (pm) 


0.064 


Atmospheric pressure (MPa) 


0.101 


Permittivity (F/m) 


8.854E-12 



which the ITO is deposited as a lower electrode. The gap size between the lower 
electrode and the membrane is 10pm. When the 160 voltage was charged between the 
lower electrode and upper electrode that the deposited on the backside of the silicon 
micro-membrane. The membrane deflected as plotted in Fig.8. The small gray square 
blocks mean the thick area of the micro -membrane and the white area means thin 
area. The upper side of the figure is the measured result and the lower the calculated 
result by the proposed method that the hypothetical maximum deflection point is 
sifted 2.5mm from the center of membrane. The experimental result agrees very well 
with the calculated one. 
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5 Concluding Remarks 

This study treats a new eoncept for designing the MEMS to eontrol the micro- 
membrane deflection by the distribution of element thickness. We developed a 
method in an attempt to optimize by the genetic algorithm the thickness distribution in 
the micro -membrane actuated by an electric force to perform a desired deflection. As 
an application of our developed method, we optimized the thickness distribution of 
the micro-membrane actuated by the electrostatic force. It was shown that the 
deflection pattern depends very much on the thickness distribution. 

By the present method, the dynamic motion of a micro membrane can be varied 
and therefore, the performance of MEMS would be optimized. 
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(a) Thickness distribution of micro-membrane 



Maximum deflection point 





Outlet 

(c) Pressure distribution (Pa) 




(d) Magnitude of strain in membrane 



Fig-5 Thickness distribution and deflection of the 
membrane with 2pm and 3 pm thick elements in 
40th generation 




Position(mm) 
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Outlet 

(a) Thickness distribution of micro-membrane 





Fig-6 Deflection of 2|im and 4|im thick element in 40th 
generation 



Experimental 



Calculated results 



Position(mm) 

Fig.8 Deflection of the micro-membrane that its thickness distribution is 
optimized 
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Fig-7 Fabrication process of the micro-membrane actuator 
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Abstruct. In parallel computing, providing the way for visualizing a large 
scale dataset becomes demanding. To provide a necessary levels of 
performance, there have been several software-based rendering system 
developed for general purpose parallel architectures. We have been developing 
the distributed rendering system focused on reducing network traffic to 
visualize a large scale dataset especially 3D geometric data. This paper 
describes the design of our distributed parallel rendering server (called On 
Demand Rendering System), and the possibility for applying this system to the 
visualization of high performance computing. 



1 Introduction 

A large scale parallel numerical simulation has become very popular in any 
scientific field. As the power and availability of general-purpose parallel computer 
systems have grown, the computer graphics community has become increasingly 
interested in exploiting them to support sophisticated rendering methods and complex 
scenes. In order to meet the demands of parallel simulation and visualization, several 
software-based renderers have been developed these days^'' . 

Software-based renderers are very effective especially massive datasets which is 
produced by large-scale scientific applications. They can easily be hundreds of 
megabytes in size, and time-dependent simulations might easily increases the output. 
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Sending such an large output data to other workstation is not practical for 
visualization. Almost all software-based renderers have API’s to be embedded in 
parallel applications to produce renderable output for clients^^l Figl shows the basic 
architecture of ordinary software-based renderers for parallel processing. 

On the other hand, we have been developing distributed rendering system called 
“On Demand Rendering System”^^'. This system is also software-based renderers for 
3D geometrical models. However, our rendering system is focusing on gathering 3D 
geometrical models (ex. VRML) stored in distributed databases on the network and 
serve them to the client as a rendered images (shown in Fig2) . In other words, former 
software-based parallel renderer represented Figl is very tightly connected with 
architecture in one parallel or cluster computer, but our system is network based 
parallel renderer. 

In this paper, we describe the design and implementation of the On Demand 
Rendering System, and discuss the possibility of network parallel rendering for 
visualizing massive datasets. 




Fig. 1. MPI based Parallel Visualization System 




Fig. 2. On Demand Rendering System architecture 
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2 Design of On Demand Rendering System 

The On Demand Rendering System is designed as a Server-Client system. It is 
basically the image based rendering system. However, considering that the system 
runs under various network and computer environments, there are several features to 
communicate between the server and client. Fig3 illustrates the architecture of the On 
Demand Rendering Server framework, which consists of the following parts; 



Simulation 



V isualization 



program 




program 








r" to 




Simulation 




Visualization 




program 




program 





On Detrumd 
Rendering 




Fig. 3. A Sample Data flow of On Demand Rendering System 



2.1 Rendering Server 

The Rendering Server is the application that receives the geometrical data from 
computer disks or database and renders it as the client’s requests. There are three 
steps defined for rendering a 3D geometrical data. They are “transformation” (defined 
as “rendering step 0”), “ rasterising” (defined as “rendering step 1”), and 
“assembling” (defined as “rendering step 2”). The transformation process translates 
3D geometrical information into 2D vector information with depth information (We 
call it 2.5D vector information), the rasterising process translates 2.5D vector 
information into 2.5D image information, and the assembling process gathers them in 
a scene. This three steps are taken charge of between server and client. The Rendering 
Server can be installed in distributed computers that are connected via network. 
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2.2 Contents Editor 

Contents Editor is the client application that receives the rendering information 
from plural Rendering Server, and arranges them in 3D space. The result of the 
arrangement is described in the tag-based ASCII text file named Layout Description 
File. A client can request the distribution policy of the three steps to the server. Fig. 4 
illustrates the case of assembling two models. In this case, server rendering step are 
set to “transformation and rasterization”. In short, two rendering steps out of three is 
performed at the server side, and the assembling step which compare the depth 
information of the images and arrange them, is processed at the client. 
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Fig. 4. An Example of Rendering Steps between Server and Client. 
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2.3 Contents Browser 



Contents Browser interprets the Layout Description File and receives the rendering 
results from appropriate Rendering Server, and displays them on the screen. 



3 Sample Analyses 




Fig. 5. An Example of Final Rendering Image at the Client 



Table 1. Benchmark Result 





Pdygr 

rtriandel 


VRML size 
IMvtel 


Server Randairg Step 


Update UrtE 
Iseiyiramel 


Server 

rsecl 


Trarsrrissicn 

rsecl 


Cliat 

rsecl 


Sizein (rarerrissicn 
IKbrtel 


Biildrc 


‘K 


Q4 


C 


4S 


Q7 


24 


13 


294 


2 


7.4 


15 


46 


13 


101 


Tsnran 


m 




C 


ao 


m 


S4 


13 


814 


2 




St 


44 


IP 


ifn 




fflPK 


47 


C 


32D 


1.7 


231 


22 


28BB 


2 


^ 


^ 


as 


1.4 


£ 



Table. 1 shows the performance examined by three kinds of geometry model. Fig. 5 
shows the model. Building, Terrain and Robot. Building is constructed by about 5000 
triangles, the data size in VRML is about 400 K Bytes. Terrain is middle size and 
Robot is the largest. The case of no rendering on the server machine is described in 
server rendering step 0 column. In this case, the compressed 3D geometry data is 
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passed to client. In the case of server rendering step 2, the most of rendering process 
are processed on the server, then, the image with Z value is passed to client. The 
image data is compressed in JPEG format. Other types, Z value, 3D geometry data are 
compress by Run-length algorithm^ The size of compressed data is shown in the “size 
in transmission”. The final rendered image size is 500x500 pixel in all case. “Update 
Time” means the update time per a frame when rotating the object. “Transmission” 
includes the data compress and de-compress process. The server side program is 
worked on a SGI indy (R5000), the client side programs are worked on a PC 
Pentiumll 450MHz. 
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Fig. 6. Benchmark Result in Graph 

Fig. 6 shows the same data in bar chart. As the 3D geometry model can not be 
compressed effectively, the large data requires much cost for the transferring in the 
case of server step 0. However, using server step 2, the cost for transferring dose not 
depend on the model size. This approach, assembling the images with Z value on a 
client machine, is popular in other parallel rendering system. But, this system allows 
user to select the transferring policy. Using isosurface technique for visualization, the 
number of generated polygon is not predicted. If it is small, 3D isosurface can be 
downloaded. If it is large, 3D model is kept on the server, only the image can be 
assembled to the client images. 

Once assembling the models, the layout can be stored in a file called “Layout 
Description File”. The assembled images are recovered by the accessed server when 
the client loading this file. The time for recovering is 28 sec for three models selecting 
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server step 2. This is almost the same as the sum total, because all server process is 
running on a machine in the case of this test. 



4 Conclusion 

We have developed the distributed rendering system of client initiative. This 
system enables to choose a 3D geometry model, a 2.5D geometry model, or 2.5D 
image data dynamically, when transmitting data to a client from a server. Rendering 
processing benchmark was performed using the model of three kinds of data sizes, 
and it was shown that it is effective to change the transmission method according to 
network load and the processing capability of a server and a client. 

Although the system by which rendering processing became independent of a 
simulation program has a limit in the improvement in the whole processing 
performance, it is effective in the construction of a visualization system which 
employed efficiently the existing assets which do not correspond to distributed 
environment. When performing interactive processing, in order to reduce the amount 
of communications and the number of times of communication between client- 
servers, We are planning to implement more intellectual functions in the client side 
application. We would expect to apply this system to numerical computation area 
especially whose environment is not combined closely, for examples, coupling 
simulation. 
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